home / mcp / voicevox tts mcp server
Provides VOICEVOX based text-to-speech via MCP with multi-speaker control, streaming playback, and cross-platform support.
Configuration
View docs{
"mcpServers": {
"kajidog-mcp-tts-voicevox": {
"url": "http://localhost:3000/mcp",
"headers": {
"VOICEVOX_URL": "http://localhost:50021",
"MCP_HTTP_HOST": "0.0.0.0",
"MCP_HTTP_MODE": "true",
"MCP_HTTP_PORT": "3000",
"MCP_ALLOWED_HOSTS": "localhost,127.0.0.1",
"VOICEVOX_USE_STREAMING": "false",
"VOICEVOX_DEFAULT_SPEAKER": "1",
"VOICEVOX_DEFAULT_IMMEDIATE": "true",
"VOICEVOX_RESTRICT_IMMEDIATE": "false",
"VOICEVOX_DEFAULT_SPEED_SCALE": "1.0",
"VOICEVOX_DEFAULT_WAIT_FOR_END": "false",
"VOICEVOX_RESTRICT_WAIT_FOR_END": "false",
"VOICEVOX_DEFAULT_WAIT_FOR_START": "false",
"VOICEVOX_RESTRICT_WAIT_FOR_START": "false"
}
}
}
}You run a VOICEVOX based text-to-speech MCP server that lets clients like Claude Desktop make your assistant speak in multiple voices, manage playback smoothly, and stream audio efficiently across Windows, macOS, and Linux.
You connect to the MCP server from your client and call the speak tool to convert text into speech. You can switch speakers within a single request by assigning different speaker IDs to different text segments. Playback can be immediate or queued with options to wait for end or start, and you can enable streaming playback when your environment supports it. Use a single call to deliver multi-segment, multi-speaker dialogue and enjoy responsive TTS in your AI agent.
Prerequisites you need before installing are Node.js 18.0.0 or higher and a running VOICEVOX Engine. It is recommended to also install ffplay for low-latency streaming playback.
Install the MCP server and required tooling with these steps.
# 1) Install Node.js (if you don’t have it)
# Use a version manager or download from nodejs.org
# 2) Start VOICEVOX Engine (must be running)
# 3) Install dependencies and start the MCP server
# This example uses npm via npx to run the server from the packageConfigure your client to point at the MCP server either in HTTP mode or via a local stdio process. The server can run in HTTP mode for remote access or as a local stdio service started by your client configurations.
{
"mcpServers": {
"tts-http": {
"type": "http",
"url": "http://localhost:3000/mcp",
"args": []
},
"tts-stdio": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@kajidog/mcp-tts-voicevox"]
}
}
}If audio does not play, verify the VOICEVOX Engine is running and that the playback tools are available on your platform (ffplay is optional but enables streaming). If the MCP client does not recognize the server, confirm the correct command or URL is used and that you restarted the client after changes.
Environment variables influence how the MCP server operates. Common settings include VOICEVOX URL, default speaker, and streaming options. You can also restrict certain playback options for safety and consistency.
# Example environment variables
export VOICEVOX_URL=http://localhost:50021
export VOICEVOX_DEFAULT_SPEAKER=1
export VOICEVOX_DEFAULT_SPEED_SCALE=1.0
export VOICEVOX_USE_STREAMING=false
export VOICEVOX_DEFAULT_IMMEDIATE=true
export VOICEVOX_DEFAULT_WAIT_FOR_START=false
export VOICEVOX_DEFAULT_WAIT_FOR_END=falseLimit access to the MCP server by configuring allowed hosts and origins when using HTTP mode. If you expose the server on a network, ensure only trusted clients can reach it.
The server exposes a set of tools to perform synthesis and manage playback from MCP clients. The primary tool is speak, which converts text to speech. Other useful tools include ping_voicevox, get_speakers, get_speaker_detail, stop_speaker, generate_query, and synthesize_file.
Text-to-speech synthesis callable from MCP clients to convert text into audio. Supports per-segment speaker selection, speed control, and playback options.
Check connectivity to the VOICEVOX Engine.
Retrieve the list of available VOICEVOX speakers.
Fetch detailed information for a specific speaker.
Stop current playback and clear the queue.
Create a speech synthesis query for debugging or inspection.
Generate an audio file from text input.