home / mcp / dino-x official mcp server
Official DINO-X Model Context Protocol (MCP) server that empowers LLMs with real-world visual perception through image object detection, localization, and captioning APIs.
Configuration
View docs{
"mcpServers": {
"idea-research-dino-x-mcp": {
"url": "https://mcp.deepdataspace.com/mcp?key=your-api-key",
"headers": {
"DINOX_API_KEY": "YOUR_API_KEY_PLACEHOLDER",
"IMAGE_STORAGE_DIRECTORY": "/path/to/your/image/directory"
}
}
}
}DINO-X MCP Server enables fine-grained object detection and image understanding in multimodal applications. It can run as a local stdio service or be accessed via a hosted streamable HTTP MCP endpoint, letting you build end-to-end visual agents and automation pipelines with structured outputs like object categories, counts, locations, and attributes.
You connect your MCP client to one or more DINO-X MCP Server endpoints and then send image inputs to trigger detection, segmentation, and reasoning tasks. You can run the server locally in stdio mode for fast development or use the hosted HTTP endpoint for deployment. The server exposes tools for full-scene object detection, text-prompted detection, human pose estimation, and visualization of results. Choose the transport mode that fits your workflow: stdio for local runs or streamable HTTP for cloud-ready access.
Prerequisites you need before starting:
1) Install Node.js (LTS) or use your preferred Node.js setup. You will run a local MCP server or use a hosted endpoint.
2) Prepare your environment variables and storage location for any annotated outputs if you plan to run in STDIO mode.
MCP connection options are available in two forms: a hosted HTTP endpoint and local stdio configurations. The HTTP option lets you connect to a remote MCP server using a single URL. The local stdio options run the MCP server on your machine using a command and arguments you provide.
Full-scene object detection returning category, bounding boxes, and optional captions for scenes.
Text-prompted object detection using English nouns to target objects and return bounding boxes plus optional captions.
Human pose estimation providing 17 keypoints, a bounding box, and optional captions.
Visualization tool that generates an annotated image from input and detection results, saving it to a local path.