home / mcp / dino-x official mcp server

DINO-X Official MCP Server

Official DINO-X Model Context Protocol (MCP) server that empowers LLMs with real-world visual perception through image object detection, localization, and captioning APIs.

Installation
Add the following to your MCP client configuration file.

Configuration

View docs
{
  "mcpServers": {
    "idea-research-dino-x-mcp": {
      "url": "https://mcp.deepdataspace.com/mcp?key=your-api-key",
      "headers": {
        "DINOX_API_KEY": "YOUR_API_KEY_PLACEHOLDER",
        "IMAGE_STORAGE_DIRECTORY": "/path/to/your/image/directory"
      }
    }
  }
}

DINO-X MCP Server enables fine-grained object detection and image understanding in multimodal applications. It can run as a local stdio service or be accessed via a hosted streamable HTTP MCP endpoint, letting you build end-to-end visual agents and automation pipelines with structured outputs like object categories, counts, locations, and attributes.

How to use

You connect your MCP client to one or more DINO-X MCP Server endpoints and then send image inputs to trigger detection, segmentation, and reasoning tasks. You can run the server locally in stdio mode for fast development or use the hosted HTTP endpoint for deployment. The server exposes tools for full-scene object detection, text-prompted detection, human pose estimation, and visualization of results. Choose the transport mode that fits your workflow: stdio for local runs or streamable HTTP for cloud-ready access.

How to install

Prerequisites you need before starting:

1) Install Node.js (LTS) or use your preferred Node.js setup. You will run a local MCP server or use a hosted endpoint.

2) Prepare your environment variables and storage location for any annotated outputs if you plan to run in STDIO mode.

Additional sections

MCP connection options are available in two forms: a hosted HTTP endpoint and local stdio configurations. The HTTP option lets you connect to a remote MCP server using a single URL. The local stdio options run the MCP server on your machine using a command and arguments you provide.

Available tools

detect-all-objects

Full-scene object detection returning category, bounding boxes, and optional captions for scenes.

detect-objects-by-text

Text-prompted object detection using English nouns to target objects and return bounding boxes plus optional captions.

detect-human-pose-keypoints

Human pose estimation providing 17 keypoints, a bounding box, and optional captions.

visualize-detection-result

Visualization tool that generates an annotated image from input and detection results, saving it to a local path.