home / skills / openclaw / skills / phone-agent

phone-agent skill

/skills/kesslerio/phone-agent

This skill enables a real-time voice AI on calls, transcribing speech, reasoning with an LLM, and speaking responses via TTS.

npx playbooks add skill openclaw/skills --skill phone-agent

Review the files below or copy the command above to add this skill to your agents.

Files (8)
SKILL.md
2.2 KB
---
name: phone-agent
description: "Run a real-time AI phone agent using Twilio, Deepgram, and ElevenLabs. Handles incoming calls, transcribes audio, generates responses via LLM, and speaks back via streaming TTS. Use when user wants to: (1) Test voice AI capabilities, (2) Handle phone calls programmatically, (3) Build a conversational voice bot."
---

# Phone Agent Skill

Runs a local FastAPI server that acts as a real-time voice bridge.

## Architecture

```
Twilio (Phone) <--> WebSocket (Audio) <--> [Local Server] <--> Deepgram (STT)
                                                  |
                                                  +--> OpenAI (LLM)
                                                  +--> ElevenLabs (TTS)
```

## Prerequisites

1.  **Twilio Account**: Phone number + TwiML App.
2.  **Deepgram API Key**: For fast speech-to-text.
3.  **OpenAI API Key**: For the conversation logic.
4.  **ElevenLabs API Key**: For realistic text-to-speech.
5.  **Ngrok** (or similar): To expose your local port 8080 to Twilio.

## Setup

1.  **Install Dependencies**:
    ```bash
    pip install -r scripts/requirements.txt
    ```

2.  **Set Environment Variables** (in `~/.moltbot/.env`, `~/.clawdbot/.env`, or export):
    ```bash
    export DEEPGRAM_API_KEY="your_key"
    export OPENAI_API_KEY="your_key"
    export ELEVENLABS_API_KEY="your_key"
    export TWILIO_ACCOUNT_SID="your_sid"
    export TWILIO_AUTH_TOKEN="your_token"
    export PORT=8080
    ```

3.  **Start the Server**:
    ```bash
    python3 scripts/server.py
    ```

4.  **Expose to Internet**:
    ```bash
    ngrok http 8080
    ```

5.  **Configure Twilio**:
    - Go to your Phone Number settings.
    - Set "Voice & Fax" -> "A Call Comes In" to **Webhook**.
    - URL: `https://<your-ngrok-url>.ngrok.io/incoming`
    - Method: `POST`

## Usage

Call your Twilio number. The agent should answer, transcribe your speech, think, and reply in a natural voice.

## Customization

- **System Prompt**: Edit `SYSTEM_PROMPT` in `scripts/server.py` to change the persona.
- **Voice**: Change `ELEVENLABS_VOICE_ID` to use different voices.
- **Model**: Switch `gpt-4o-mini` to `gpt-4` for smarter (but slower) responses.

Overview

This skill runs a real-time AI phone agent that connects Twilio calls to a local FastAPI server, transcribes audio with Deepgram, generates conversational responses with an LLM, and returns spoken audio via ElevenLabs streaming TTS. It is designed for rapid testing and prototyping of voice AI, or for programmatic handling of phone calls. The setup uses ngrok (or similar) to expose a local port to Twilio for live phone integration.

How this skill works

Incoming Twilio calls are routed to the FastAPI server over a webhook and upgraded to a WebSocket audio stream. The server forwards audio to Deepgram for speech-to-text, passes transcriptions to an LLM for intent and response generation, and streams ElevenLabs TTS audio back to the caller. Configuration is driven by environment variables for API keys, Twilio credentials, and voice/model choices.

When to use it

  • Prototype and test conversational voice agents before production deployment
  • Programmatically answer and handle inbound phone calls with AI-driven logic
  • Evaluate speech-to-text and streaming TTS integration in a single pipeline
  • Build a natural-voice customer support or IVR prototype
  • Experiment with different LLMs, system prompts, or TTS voices quickly

Best practices

  • Run the server locally and expose only the required port with ngrok or a secure tunnel
  • Keep API keys and sensitive configs in environment variables, not in source
  • Start with a lightweight model for iteration, then move to larger models for production
  • Limit call concurrency and add logging to monitor transcription and TTS quality
  • Customize the SYSTEM_PROMPT and voice ID to match your desired persona and tone

Example use cases

  • A developer testing a voice assistant’s conversational flow on a real phone number
  • A small business deploying an AI receptionist to answer routine inbound calls
  • A product team benchmarking Deepgram STT vs. alternative transcribers in live calls
  • Rapidly iterating on TTS voice and prompt design for a customer service bot
  • Integrating an LLM-driven IVR that can hand off to a human when needed

FAQ

What external services do I need?

You need Twilio (phone + TwiML app), Deepgram (STT), an LLM provider (OpenAI or similar), and ElevenLabs (TTS).

Can I run this in production?

This setup is intended for prototyping. For production, replace ngrok with a stable HTTPS endpoint, add authentication, scale handling, and robust error/retry logic.