home / skills / railwayapp / railway-skills / metrics

This skill helps you monitor resource usage and performance for Railway services by querying CPU, memory, and network metrics.

npx playbooks add skill railwayapp/railway-skills --skill metrics

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
5.6 KB
---
name: metrics
description: This skill should be used when the user asks about resource usage, CPU, memory, network, disk, or service performance. Covers questions like "how much memory is my service using" or "is my service slow".
allowed-tools: Bash(railway:*)
---

# Service Metrics

Query resource usage metrics for Railway services.

## When to Use

- User asks "how much memory is my service using?"
- User asks about CPU usage, network traffic, disk usage
- User wants to debug performance issues
- User asks "is my service healthy?" (combine with `service` skill)

## Prerequisites

Get environmentId and serviceId from linked project:

```bash
railway status --json
```

Extract:
- `environment.id` → environmentId
- `service.id` → serviceId (optional - omit to get all services)

## MetricMeasurement Values

| Measurement | Description |
|-------------|-------------|
| CPU_USAGE | CPU usage (cores) |
| CPU_LIMIT | CPU limit (cores) |
| MEMORY_USAGE_GB | Memory usage in GB |
| MEMORY_LIMIT_GB | Memory limit in GB |
| NETWORK_RX_GB | Network received in GB |
| NETWORK_TX_GB | Network transmitted in GB |
| DISK_USAGE_GB | Disk usage in GB |
| EPHEMERAL_DISK_USAGE_GB | Ephemeral disk usage in GB |
| BACKUP_USAGE_GB | Backup usage in GB |

## MetricTag Values (for groupBy)

| Tag | Description |
|-----|-------------|
| DEPLOYMENT_ID | Group by deployment |
| DEPLOYMENT_INSTANCE_ID | Group by instance |
| REGION | Group by region |
| SERVICE_ID | Group by service |

## Query

```graphql
query metrics(
  $environmentId: String!
  $serviceId: String
  $startDate: DateTime!
  $endDate: DateTime
  $sampleRateSeconds: Int
  $averagingWindowSeconds: Int
  $groupBy: [MetricTag!]
  $measurements: [MetricMeasurement!]!
) {
  metrics(
    environmentId: $environmentId
    serviceId: $serviceId
    startDate: $startDate
    endDate: $endDate
    sampleRateSeconds: $sampleRateSeconds
    averagingWindowSeconds: $averagingWindowSeconds
    groupBy: $groupBy
    measurements: $measurements
  ) {
    measurement
    tags {
      deploymentInstanceId
      deploymentId
      serviceId
      region
    }
    values {
      ts
      value
    }
  }
}
```

## Example: Last Hour CPU and Memory

Use heredoc to avoid shell escaping issues:

```bash
bash <<'SCRIPT'
START_DATE=$(date -u -v-1H +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date -u -d "1 hour ago" +"%Y-%m-%dT%H:%M:%SZ")
ENV_ID="your-environment-id"
SERVICE_ID="your-service-id"

VARS=$(jq -n \
  --arg env "$ENV_ID" \
  --arg svc "$SERVICE_ID" \
  --arg start "$START_DATE" \
  '{environmentId: $env, serviceId: $svc, startDate: $start, measurements: ["CPU_USAGE", "MEMORY_USAGE_GB"]}')

scripts/railway-api.sh \
  'query metrics($environmentId: String!, $serviceId: String, $startDate: DateTime!, $measurements: [MetricMeasurement!]!) {
    metrics(environmentId: $environmentId, serviceId: $serviceId, startDate: $startDate, measurements: $measurements) {
      measurement
      tags { deploymentId region serviceId }
      values { ts value }
    }
  }' \
  "$VARS"
SCRIPT
```

## Example: All Services in Environment

Omit serviceId and use groupBy to get metrics for all services:

```bash
bash <<'SCRIPT'
START_DATE=$(date -u -v-1H +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date -u -d "1 hour ago" +"%Y-%m-%dT%H:%M:%SZ")
ENV_ID="your-environment-id"

VARS=$(jq -n \
  --arg env "$ENV_ID" \
  --arg start "$START_DATE" \
  '{environmentId: $env, startDate: $start, measurements: ["CPU_USAGE", "MEMORY_USAGE_GB"], groupBy: ["SERVICE_ID"]}')

scripts/railway-api.sh \
  'query metrics($environmentId: String!, $startDate: DateTime!, $measurements: [MetricMeasurement!]!, $groupBy: [MetricTag!]) {
    metrics(environmentId: $environmentId, startDate: $startDate, measurements: $measurements, groupBy: $groupBy) {
      measurement
      tags { serviceId region }
      values { ts value }
    }
  }' \
  "$VARS"
SCRIPT
```

## Time Parameters

| Parameter | Description |
|-----------|-------------|
| startDate | Required. ISO 8601 format (e.g., `2024-01-01T00:00:00Z`) |
| endDate | Optional. Defaults to now |
| sampleRateSeconds | Sample interval (e.g., 60 for 1-minute samples) |
| averagingWindowSeconds | Averaging window for smoothing |

**Tip:** For last hour, calculate startDate as `now - 1 hour` in ISO format.

## Output Interpretation

```json
{
  "data": {
    "metrics": [
      {
        "measurement": "CPU_USAGE",
        "tags": { "deploymentId": "...", "serviceId": "...", "region": "us-west1" },
        "values": [
          { "ts": "2024-01-01T00:00:00Z", "value": 0.25 },
          { "ts": "2024-01-01T00:01:00Z", "value": 0.30 }
        ]
      }
    ]
  }
}
```

- `ts` - timestamp in ISO format
- `value` - metric value (cores for CPU, GB for memory/disk/network)

## Composability

- **Get IDs**: Use `status` skill or `railway status --json`
- **Check service health**: Use `service` skill for deployment status
- **View logs**: Use `deployment` skill if metrics show issues
- **Scale service**: Use `environment` skill to adjust resources

## Error Handling

### Empty/Null Metrics

Services without active deployments return empty metrics arrays. When processing with jq, handle nulls:

```bash
# Safe iteration - skip nulls
jq -r '.data.metrics[]? | select(.values != null and (.values | length) > 0) | ...'

# Check if metrics exist before processing
jq -e '.data.metrics | length > 0' response.json && echo "has metrics"
```

### No Metrics Data

Service may be new or have no traffic. Check:
- Service has active deployment (stopped services have no metrics)
- Time range includes deployment period

### Invalid Service/Environment ID

Verify IDs with `railway status --json`.

### Permission Denied

User needs access to the project to query metrics.

Overview

This skill queries resource usage and performance metrics for Railway services. It returns CPU, memory, network, disk, and related measurements over a specified time range to help diagnose performance or capacity issues. Use it alongside service and deployment checks to assess health and root causes.

How this skill works

It calls the Railway metrics GraphQL endpoint with environmentId (and optional serviceId), a startDate (and optional endDate), requested measurements, and grouping tags. The response includes timestamped values per measurement and optional tags such as serviceId, deploymentId, and region. Clients typically run a small shell wrapper that builds the JSON variables and posts the GraphQL query.

When to use it

  • You want current or historical CPU, memory, disk, or network usage for a service.
  • You need to determine if a service is slow or under resource pressure.
  • You want per-deployment or per-instance metrics by grouping tags.
  • You are validating capacity needs before scaling a service.
  • You need to correlate logs and deployments with resource spikes.

Best practices

  • Obtain environmentId and serviceId from railway status --json before querying.
  • Always set startDate in ISO 8601 format; omit endDate to use now.
  • Use sensible sampleRateSeconds and averagingWindowSeconds to reduce noise (e.g., 60s sample, 300s window).
  • Group by SERVICE_ID or DEPLOYMENT_INSTANCE_ID for targeted troubleshooting.
  • Handle empty or null metrics safely — services without active deployments return no data.

Example use cases

  • Get last hour CPU and memory for a single service to check recent spikes.
  • Query all services in an environment grouped by SERVICE_ID to find the top resource consumers.
  • Fetch per-instance CPU_USAGE to identify a noisy deployment instance.
  • Compare MEMORY_USAGE_GB to MEMORY_LIMIT_GB to detect memory pressure before OOMs.
  • Collect NETWORK_RX_GB and NETWORK_TX_GB to investigate traffic anomalies.

FAQ

What measurements are available?

Common measurements include CPU_USAGE, CPU_LIMIT, MEMORY_USAGE_GB, MEMORY_LIMIT_GB, NETWORK_RX_GB, NETWORK_TX_GB, DISK_USAGE_GB, EPHEMERAL_DISK_USAGE_GB, and BACKUP_USAGE_GB.

How do I get environmentId and serviceId?

Run railway status --json and extract environment.id and service.id (omit serviceId to query all services).

Why am I getting no metrics?

No metrics can mean the service has no active deployment, the time range excludes deployment lifetime, or you lack project permissions. Verify deployment and IDs.