home / mcp / whisper mcp server
Provides audio management and transcription capabilities via MCP using Whisper and GPT-4o models.
Configuration
View docs{
"mcpServers": {
"arcaputo3-mcp-server-whisper": {
"command": "uv",
"args": [
"run",
"mcp-server-whisper"
],
"env": {
"OPENAI_API_KEY": "${OPENAI_API_KEY}",
"AUDIO_FILES_PATH": "${AUDIO_FILES_PATH}"
}
}
}
}You can run MCP Server Whisper to process audio files with advanced transcription and processing capabilities using OpenAI Whisper and GPT-4o models. By exposing a structured, type-safe set of tools, it lets you locate, convert, transcribe, enhance, and convert audio efficiently, while leveraging parallelism and caching to speed up repeated tasks.
Use an MCP client to call the available tools for audio management and transcription. You can search for audio files with filters, convert formats, compress large files, and perform transcription with various OpenAI models. You can also chat about audio content, and generate text-to-speech outputs. All results are returned in typed, self-describing formats to simplify integration with AI assistants.
# Clone the server and enter the project
git clone https://github.com/arcaputo3/mcp-server-whisper.git
cd mcp-server-whisper
# Install dependencies (the project uses uv for development)
# If a package manager is needed, follow the project’s standard steps for Python and MCP tooling
# Prepare environment and run the server development workflow
# Environment variables must be available at runtime
# Start the MCP server via the development command (as shown in the configuration snippet)
uv run mcp-server-whisper{
"mcpServers": {
"whisper": {
"command": "uv",
"args": ["run", "mcp-server-whisper"],
"env": {
"OPENAI_API_KEY": "${OPENAI_API_KEY}",
"AUDIO_FILES_PATH": "${AUDIO_FILES_PATH}"
}
}
}
}Environment variables OPENAI_API_KEY and AUDIO_FILES_PATH must be defined in your runtime environment. You can populate them in a .env file and load them as needed depending on your local setup.
The server exposes a comprehensive set of tools for audio file management and processing as described below. Each tool returns a strongly typed result to ensure reliable integration with MCP clients.
list_audio_files — Lists audio files with comprehensive filtering and sorting options, returning full metadata.
get_latest_audio — Retrieves the most recently modified audio file with model support information.
convert_audio — Converts audio files to supported formats (mp3 or wav) and returns the output path.
compress_audio — Compresses audio files that exceed size limits and returns the output path.
transcribe_audio — Transcribes audio using OpenAI models such as whisper-1, gpt-4o-transcribe, or gpt-4o-mini-transcribe with optional prompts and timestamps.
chat_with_audio — Interactive audio analysis using GPT-4o audio models and prompts, returning conversational responses.
transcribe_with_enhancement — Enhanced transcription with templates like detailed, storytelling, professional, and analytical, returning enhanced transcripts.
create_audio — Text-to-speech generation using gpt-4o-mini-tts with multiple voices and adjustable speed.
Files larger than 25 MB are automatically compressed to meet API limits. Transcriptions can include timestamps and structured outputs for easy downstream use.
Lists audio files with filtering and sorting options and returns full metadata for each file.
Fetches the most recently modified audio file along with model support information.
Converts audio files to supported formats (mp3 or wav) and returns the output path.
Compresses oversized audio files to meet API size limits and returns the output path.
Performs transcription using OpenAI models like whisper-1, gpt-4o-transcribe, or gpt-4o-mini-transcribe with optional prompts and timestamps.
Engages in interactive audio analysis using GPT-4o audio models and returns conversational responses.
Provides enhanced transcription using templates such as detailed, storytelling, professional, or analytical.
Generates speech from text using text-to-speech models such as gpt-4o-mini-tts with multiple voices and adjustable speed.