home / skills / ryanmaclean / dd-skill-test / datadog-operations

datadog-operations skill

/.claude/skills/datadog-operations

This skill helps you investigate performance issues and automate Datadog operations by querying traces, logs, metrics, and managing monitors and dashboards.

npx playbooks add skill ryanmaclean/dd-skill-test --skill datadog-operations

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
6.2 KB
---
name: datadog
description: Query Datadog APM traces, logs, metrics, SLOs, security signals, and service catalog. Create monitors, dashboards, and synthetic tests. Manage incidents and workflows. Use when investigating performance issues, searching logs, checking SLOs, analyzing costs, or automating Datadog operations.
---

# Datadog Operations Skill

Comprehensive Datadog automation: query APM/logs/metrics/RUM/database, create monitors/dashboards/synthetics, manage incidents, trigger workflows, analyze costs and LLM usage.

## Quick Setup

### macOS
```bash
./setup.sh
```

### Linux
```bash
./setup-linux.sh
```

### Windows (PowerShell as Administrator)
```powershell
.\setup-windows.ps1
```

## Platform-Specific Setup

### macOS

**Install jq** (required for JSON processing):
```bash
# Homebrew (recommended)
brew install jq

# MacPorts
sudo port install jq

# Direct binary (Apple Silicon)
curl -L -o ~/bin/jq https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-macos-arm64
chmod +x ~/bin/jq

# Direct binary (Intel)
curl -L -o ~/bin/jq https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-macos-amd64
chmod +x ~/bin/jq
```

**Set environment variables** in `~/.zshrc`:
```bash
export DD_API_KEY="your_api_key"
export DD_APP_KEY="your_application_key"
export DD_SITE="datadoghq.com"
```

### Linux

**Install dependencies:**
```bash
# Debian/Ubuntu
sudo apt-get install -y jq curl bc

# Fedora/RHEL
sudo dnf install -y jq curl bc

# Arch
sudo pacman -S jq curl bc
```

**Set environment variables** in `~/.bashrc`:
```bash
export DD_API_KEY="your_api_key"
export DD_APP_KEY="your_application_key"
export DD_SITE="datadoghq.com"
```

### Windows

**Option 1: WSL (Recommended)**
```bash
wsl --install
# Then run Linux setup inside WSL
./setup-linux.sh
```

**Option 2: Git Bash**
Install Git for Windows, then run `./setup.sh` in Git Bash.

**Option 3: Native PowerShell**
```powershell
# Install jq
winget install jqlang.jq
# Or: choco install jq

# Set environment variables
[Environment]::SetEnvironmentVariable("DD_API_KEY", "your_key", "User")
[Environment]::SetEnvironmentVariable("DD_APP_KEY", "your_app_key", "User")
```

Use Python scripts directly: `python python\script_name.py`

## Required Permissions

Get keys from Datadog → Organization Settings → API Keys / Application Keys

**For workflow automation (`trigger-workflow`):**
Enable "Actions API Access" on your app key:
Datadog → Organization Settings → Application Keys → Click key → Enable "Actions API Access"

## Quick Reference

### Investigation Scripts

| Script | Purpose | Example |
|--------|---------|---------|
| `query-apm.sh` | Find slow endpoints | `bash scripts/query-apm.sh --service my-service --duration 1h` |
| `search-logs.sh` | Search logs for errors | `bash scripts/search-logs.sh --query "status:error" --duration 1h` |
| `query-security-signals.sh` | Find security threats | `bash scripts/query-security-signals.sh --severity critical` |
| `query-watchdog.sh` | Anomaly detection | `bash scripts/query-watchdog.sh --service my-service` |
| `query-metrics.sh` | Fetch metrics data | `bash scripts/query-metrics.sh --metric "system.cpu.user"` |
| `analyze-usage-cost.sh` | FinOps cost analysis | `bash scripts/analyze-usage-cost.sh --duration 30d` |
| `analyze-llm.sh` | LLM observability | `bash scripts/analyze-llm.sh --service my-llm-app` |
| `query-slos.sh` | SLO status | `bash scripts/query-slos.sh --service payment-api` |
| `query-service-catalog.sh` | Service metadata | `bash scripts/query-service-catalog.sh list` |
| `query-database.sh` | DB performance | `bash scripts/query-database.sh --host postgres-prod` |
| `query-rum.sh` | Frontend performance | `bash scripts/query-rum.sh --application abc-123` |
| `query-kubernetes.sh` | K8s workloads | `bash scripts/query-kubernetes.sh --cluster prod` |
| `query-containers.sh` | Container metrics | `bash scripts/query-containers.sh --duration 1h` |
| `query-network.sh` | Network monitoring | `bash scripts/query-network.sh --duration 1h` |

### Automation Scripts

| Script | Purpose | Example |
|--------|---------|---------|
| `manage-monitors.sh` | Create/mute monitors | `bash scripts/manage-monitors.sh list` |
| `create-dashboard.sh` | Generate dashboards | `bash scripts/create-dashboard.sh --service my-service` |
| `trigger-workflow.sh` | Execute workflows | `bash scripts/trigger-workflow.sh list` |
| `manage-incidents.sh` | Incident management | `bash scripts/manage-incidents.sh list` |
| `manage-synthetics.sh` | Synthetic tests | `bash scripts/manage-synthetics.sh list` |
| `manage-on-call.sh` | On-call scheduling | `bash scripts/manage-on-call.sh list` |
| `manage-status-pages.sh` | Status pages | `bash scripts/manage-status-pages.sh list` |
| `verify-setup.sh` | Validate config | `bash scripts/verify-setup.sh` |

## Workflows

### Investigate Production Issue
```bash
bash scripts/query-watchdog.sh --service affected-service --duration 24h
bash scripts/query-apm.sh --service affected-service --duration 1h
bash scripts/search-logs.sh --service affected-service --status error --duration 1h
```

### Check Cost & Usage
```bash
bash scripts/analyze-usage-cost.sh --duration 30d --product all
bash scripts/analyze-usage-cost.sh --duration 30d | jq '.recommendations[] | select(.priority == "high")'
```

### Monitor LLM Application
```bash
bash scripts/analyze-llm.sh --service my-genai-app --duration 24h
```

## Output Format

All scripts return structured JSON to stdout, status messages to stderr:
```bash
bash scripts/query-apm.sh --service my-service 2>/dev/null | jq '.summary'
```

## Python Alternative

Python versions available in `python/` directory:
```bash
python3 -m venv .venv && source .venv/bin/activate
pip install -r python/requirements.txt
python python/query_apm.py --service my-service --json
```

## Go CLI Alternative

Single-binary CLI available in `../dd-skill-test-go/`:
```bash
# Native on all platforms - no dependencies
datadog-cli apm --service my-service --duration 1h
datadog-cli logs --query "status:error"
datadog-cli monitors list
datadog-cli health
```

See `../dd-skill-test-go/README.md` for installation.

## Notes

- **Advanced Scripts:** Some scripts require Python. Run `./setup.sh` first.
- **Windows:** Use WSL, Git Bash, or Python scripts directly.
- **Go CLI:** Recommended for Windows - native binary, no dependencies.

Overview

This skill automates Datadog operations to query APM traces, logs, metrics, RUM, SLOs, security signals, and the service catalog. It also creates and manages monitors, dashboards, synthetic tests, incidents, and workflows for fast investigation and routine automation. The toolset includes Bash scripts, Python alternatives, and a cross-platform Go CLI for native single-binary use.

How this skill works

Scripts and CLI calls query Datadog APIs and return structured JSON to stdout for easy parsing. Common workflows include searching traces and logs, fetching metric time series, checking SLO status, analyzing costs and LLM usage, and executing Datadog actions like creating monitors or triggering workflows. Setup scripts bootstrap dependencies and environment variables for macOS, Linux, and Windows (including WSL/Git Bash).

When to use it

  • Investigating production performance issues across APM, RUM, and logs
  • Automating creation and maintenance of monitors, dashboards, and synthetic tests
  • Managing incidents and triggering operational workflows programmatically
  • Auditing SLOs and service catalog state for reliability checks
  • Running FinOps and LLM usage/cost analysis over a time window

Best practices

  • Store DD_API_KEY and DD_APP_KEY as environment variables and limit key scopes to least privilege
  • Use the Go CLI binary for native cross-platform runs on Windows to avoid extra dependencies
  • Pipe JSON output to jq or programmatic parsers to extract summaries and automate alerts
  • Enable Actions API Access on your application key to run workflow-triggering scripts
  • Run verify-setup.sh after initial install to validate connectivity and permissions

Example use cases

  • Run query-apm.sh to find slow endpoints for a service during the last hour and export a summary
  • Search logs for error spikes with search-logs.sh and combine results with query-metrics.sh for correlated CPU or latency anomalies
  • Create a dashboard and monitors for a newly onboarded service using create-dashboard.sh and manage-monitors.sh
  • Execute an incident triage script that queries watchdog, APM, and logs then triggers a remediation workflow
  • Run analyze-usage-cost.sh to produce FinOps recommendations and filter high-priority items with jq

FAQ

What keys and permissions are required?

You need a Datadog API key and an application key. To trigger workflows, enable Actions API Access on the app key.

Which platform should I use on Windows?

WSL or Git Bash is recommended; alternatively use the Go CLI binary for native execution without extra dependencies.

How do scripts return data for automation?

All scripts print structured JSON to stdout and status messages to stderr, so consume stdout with jq or other JSON parsers.