home / mcp / baidu xiling mcp server
Baidu XiLing DIgital Human MCP Server
Configuration
View docs{
"mcpServers": {
"baidu-xiling-mcp": {
"command": "uvx",
"args": [
"${path/to/dh-mcp-server}"
],
"env": {
"DH_API_AK": "${API Key}",
"DH_API_SK": "${Secret Key}"
}
}
}
}You can access Baidu Xiling Digital Human MCP capabilities through an MCP client to generate digital portraits, synthesize videos, perform voice cloning, and generate audio—all via MCP-compliant interfaces. This server exposes a variety of tools to quickly integrate digital human services into your models and applications, enabling end-to-end workflows from portrait creation to video and audio production.
Connect your MCP-enabled agent or client to the server using the provided MCP configuration. You can start by listing available tools, then invoke the specific endpoints for 2D portrait creation, video synthesis, 123-second video production, speech synthesis, file uploads, voice queries, figure queries, and sound cloning. Use the appropriate tool for your scenario (e.g., create a 2D portrait with a real video, generate a digital human video from an existing portrait and timbre, or synthesize audio from text). Each tool returns task or figure identifiers that you can poll for status until your final artifact (video or audio) is available.
Typical usage patterns include: 1) Upload any required media assets (videos, audio) for your chosen workflow. 2) Choose a digital portrait or let the system generate a portrait from a source video. 3) Submit a synthesis or cloning task and poll for status using the task or figure IDs. 4) Retrieve the resulting video or audio URL when the task succeeds.
Prerequisites you need before installing the MCP server: - Python 3.12 or higher - API Key and Secret Key from Xiling Open Platform - Internet access to install packages and access MCP services.
Install required tooling and the MCP server package, then start using MCP with a local inspector for testing.
Configuration and start-up rely on an MCP runtime that can be run in a local development environment or integrated into your existing toolchain via MCP. The server is designed to be accessible from an MCP client and supports both local (stdio) and remote (http) connection patterns. Ensure you provide your API credentials in secure environment variables when starting the server.
Sample local start configuration demonstrates how to wire the MCP server into your development environment. The following configuration shows two stdio connections: one for a local digital human MCP wrapper and another for the Baidu Digital Human MCP Server package. Use your actual API credentials where placeholders appear.
{
"mcpServers": {
"DH-STDIO": {
"timeout": 60,
"type": "stdio",
"command": "uvx",
"args": [
"${path/to/dh-mcp-server}"
],
"env": {
"DH_API_AK": "${API Key}",
"DH_API_SK": "${Secret Key}"
}
},
"baidu_dh": {
"timeout": 60,
"type": "stdio",
"command": "uvx",
"args": [
"mcp-server-baidu-digitalhuman"
],
"env": {
"DH_API_AK": "${API Key}",
"DH_API_SK": "${Secret Key}"
}
}
}
}Keep API keys and secret keys secure. Do not expose credentials in client-side code or logs. Use environment variables and secret management practices to protect sensitive information. When deploying, prefer running the MCP server behind authentication and access controls to prevent unauthorized usage.
If a task fails, review the provided failedCode and failedMessage to identify the root cause. Check that required inputs (files, portrait IDs, or text content) meet the specified limits, and ensure the network path to the MCP server is reachable from your client. Use the status endpoints to poll for progress and retrieve the final outputs once a status of SUCCESS is reported.
Example workflows you can build with MCP: - Create a 2D digital portrait from a video and then generate a digital human video using a chosen timbre. - Generate a short 10-second to 4-minute live video using a pre-existing portrait and a selected voice model. - Synthesize speech from text for audio-only outputs using a chosen timbre. - Upload media assets once and reuse them across subsequent tasks like video production and voice cloning.
Generate a digital portrait from an uploaded real-person video for basic video production using a universal lip drive.
Query progress of digital portrait generation and list available system portraits.
Create a digital human video from a selected portrait and timbre with options for driving type, resolution, and optional subtitles.
Poll the status of a digital human video synthesis task and retrieve the video URL on success.
Produce a digital human video directly from a sample video and timbre without portrait generation.
Query the status of a 123 digital human video synthesis task and retrieve the output URL.
Synthesize audio from text based on a chosen timbre without generating video.
Query the status of text-to-audio synthesis and retrieve the audio URL.
Upload required media files for subsequent digital human services such as cloning or video production.
Query available system and clone voices for selection.
Query available 2D digital portrait figures.
Create timbres by cloning from uploaded audio for use in synthesis and video production.
Check the status and results of a voice clone task.