home / skills / axiomhq / skills / building-dashboards

building-dashboards skill

needs review

This skill helps you design and deploy Axiom dashboards via API, emphasizing decisions, clarity, and fast evidence through panels.

npx playbooks add skill axiomhq/skills --skill building-dashboards

Review the files below or copy the command above to add this skill to your agents.

Files (32)

SKILL.md

15.1 KB

---
name: building-dashboards
description: Designs and builds Axiom dashboards via API. Covers chart types, APL patterns, SmartFilters, layout, and configuration options. Use when creating dashboards, migrating from Splunk, or configuring chart options.
---

# Building Dashboards

You design dashboards that help humans make decisions quickly. Dashboards are products: audience, questions, and actions matter more than chart count.

## Philosophy

1. **Decisions first.** Every panel answers a question that leads to an action.
2. **Overview → drilldown → evidence.** Start broad, narrow on click/filter, end with raw logs.
3. **Rates and percentiles over averages.** Averages hide problems; p95/p99 expose them.
4. **Simple beats dense.** One question per panel. No chart junk.
5. **Validate with data.** Never guess fields—discover schema first.

---

## Entry Points

Choose your starting point:

| Starting from | Workflow |
|---------------|----------|
| **Vague description** | Intake → design blueprint → APL per panel → deploy |
| **Template** | Pick template → customize dataset/service/env → deploy |
| **Splunk dashboard** | Extract SPL → translate via spl-to-apl → map to chart types → deploy |
| **Exploration** | Use axiom-sre to discover schema/signals → productize into panels |

---

## Intake: What to Ask First

Before designing, clarify:

1. **Audience & decision**
   - Oncall triage? (fast refresh, error-focused)
   - Team health? (daily trends, SLO tracking)
   - Exec reporting? (weekly summaries, high-level)

2. **Scope**
   - Service, environment, region, cluster, endpoint?
   - Single service or cross-service view?

3. **Datasets**
   - Which Axiom datasets contain the data?
   - Run `getschema` to discover fields—never guess:
   ```apl
   ['dataset'] | where _time between (ago(1h) .. now()) | getschema
   ```

4. **Golden signals**
   - Traffic: requests/sec, events/min
   - Errors: error rate, 5xx count
   - Latency: p50, p95, p99 duration
   - Saturation: CPU, memory, queue depth, connections

5. **Drilldown dimensions**
   - What do users filter/group by? (service, route, status, pod, customer_id)

---

## Dashboard Blueprint

Use this 4-section structure as the default:

### 1. At-a-Glance (Statistic panels)
Single numbers that answer "is it broken right now?"
- Error rate (last 5m)
- p95 latency (last 5m)
- Request rate (last 5m)
- Active alerts (if applicable)

### 2. Trends (TimeSeries panels)
Time-based patterns that answer "what changed?"
- Traffic over time
- Error rate over time
- Latency percentiles over time
- Stacked by status/service for comparison

### 3. Breakdowns (Table/Pie panels)
Top-N analysis that answers "where should I look?"
- Top 10 failing routes
- Top 10 error messages
- Worst pods by error rate
- Request distribution by status

### 4. Evidence (LogStream + SmartFilter)
Raw events that answer "what exactly happened?"
- LogStream filtered to errors
- SmartFilter for service/env/route
- Key fields projected for readability

---

## Layout Auto-Normalization

The console uses `react-grid-layout` which requires `minH`, `minW`, `moved`, and `static` on every layout entry. The `dashboard-create` and `dashboard-update` scripts auto-fill these if omitted, so layout entries only need `i`, `x`, `y`, `w`, `h`.

---

## Chart Types

**Note:** Dashboard queries inherit time from the UI picker—no explicit `_time` filter needed.

**Validation:** TimeSeries, Statistic, Table, Pie, LogStream, Note, MonitorList are fully validated by `dashboard-validate`. Heatmap, ScatterPlot, SmartFilter work but may trigger warnings.

### Statistic
**When:** Single KPI, current value, threshold comparison.

```apl
['logs']
| where service == "api"
| summarize 
    total = count(),
    errors = countif(status >= 500)
| extend error_rate = round(100.0 * errors / total, 2)
| project error_rate
```

**Pitfalls:** Don't use for time series; ensure query returns single row.

### TimeSeries
**When:** Trends over time, before/after comparison, rate changes.

```apl
// Single metric - use bin_auto for automatic sizing
['logs']
| summarize ['req/min'] = count() by bin_auto(_time)

// Latency percentiles - use percentiles_array for proper overlay
['logs']
| summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time)
```

**Best practices:**
- Use `bin_auto(_time)` instead of fixed `bin(_time, 1m)` — auto-adjusts to time window
- Use `percentiles_array()` instead of multiple `percentile()` calls — renders as one chart
- Too many series = unreadable; use `top N` or filter

### Table
**When:** Top-N lists, detailed breakdowns, exportable data.

```apl
['logs']
| where status >= 500
| summarize errors = count() by route, error_message
| top 10 by errors
| project route, error_message, errors
```

**Pitfalls:**
- Always use `top N` to prevent unbounded results
- Use `project` to control column order and names

### Pie
**When:** Share-of-total for LOW cardinality dimensions (≤6 slices).

```apl
['logs']
| summarize count() by status_class = case(
    status < 300, "2xx",
    status < 400, "3xx",
    status < 500, "4xx",
    "5xx"
  )
```

**Pitfalls:**
- Never use for high cardinality (routes, user IDs)
- Prefer tables for >6 categories
- Always aggregate to reduce slices

### LogStream
**When:** Raw event inspection, debugging, evidence gathering.

```apl
['logs']
| where service == "api" and status >= 500
| project-keep _time, trace_id, route, status, error_message, duration_ms
| take 100
```

**Pitfalls:**
- Always include `take N` (100-500 max)
- Use `project-keep` to show relevant fields only
- Filter aggressively—raw logs are expensive

### Heatmap
**When:** Distribution visualization, latency patterns, density analysis.

```apl
['logs']
| summarize histogram(duration_ms, 15) by bin_auto(_time)
```

**Best for:** Latency distributions, response time patterns, identifying outliers.

### Scatter Plot
**When:** Correlation between two metrics, identifying patterns.

```apl
['logs']
| summarize avg(duration_ms), avg(resp_size_bytes) by route
```

**Best for:** Response size vs latency correlation, resource usage patterns.

### SmartFilter (Filter Bar)
**When:** Interactive filtering for the entire dashboard.

SmartFilter is a **chart type** that creates dropdown/search filters. Requires:
1. A `SmartFilter` chart with filter definitions
2. `declare query_parameters` in each panel query

**Filter types:**
- `selectType: "apl"` — Dynamic dropdown from APL query
- `selectType: "list"` — Static dropdown with predefined options
- `type: "search"` — Free-text input

**Panel query pattern:**
```apl
declare query_parameters (country_filter:string = "");
['logs'] | where isempty(country_filter) or ['geo.country'] == country_filter
```

See `reference/smartfilter.md` for full JSON structure and cascading filter examples.

### Monitor List
**When:** Display monitor status on operational dashboards.

No APL needed—select monitors from the UI. Shows:
- Monitor status (normal/triggered/off)
- Run history (green/red squares)
- Dataset, type, notifiers

### Note
**When:** Context, instructions, section headers.

Use GitHub Flavored Markdown for:
- Dashboard purpose and audience
- Runbook links
- Section dividers
- On-call instructions

---

## Chart Configuration

Charts support JSON configuration options beyond the query. See `reference/chart-config.md` for full details.

**Quick reference:**

| Chart Type | Key Options |
|------------|-------------|
| Statistic | `colorScheme`, `customUnits`, `unit`, `showChart` (sparkline), `errorThreshold`/`warningThreshold` |
| TimeSeries | `aggChartOpts`: `variant` (line/area/bars), `scaleDistr` (linear/log), `displayNull` |
| LogStream/Table | `tableSettings`: `columns`, `fontSize`, `highlightSeverity`, `wrapLines` |
| Pie | `hideHeader` |
| Note | `text` (markdown), `variant` |

**Common options (all charts):**
- `overrideDashboardTimeRange`: boolean
- `overrideDashboardCompareAgainst`: boolean  
- `hideHeader`: boolean

---

## APL Patterns

### Time Filtering in Dashboards vs Ad-hoc Queries

**Dashboard panel queries do NOT need explicit time filters.** The dashboard UI time picker automatically scopes all queries to the selected time window.

```apl
// DASHBOARD QUERY — no time filter needed
['logs']
| where service == "api"
| summarize count() by bin_auto(_time)
```

**Ad-hoc queries (Axiom Query tab, axiom-sre exploration) MUST have explicit time filters:**

```apl
// AD-HOC QUERY — always include time filter
['logs']
| where _time between (ago(1h) .. now())
| where service == "api"
| summarize count() by bin_auto(_time)
```

### Bin Size Selection

**Prefer `bin_auto(_time)`** — it automatically adjusts to the dashboard time window.

Manual bin sizes (only when auto doesn't fit your needs):

| Time window | Bin size |
|-------------|----------|
| 15m | 10s–30s |
| 1h | 1m |
| 6h | 5m |
| 24h | 15m–1h |
| 7d | 1h–6h |

### Cardinality Guardrails
Prevent query explosion:

```apl
// GOOD: bounded
| summarize count() by route | top 10 by count_

// BAD: unbounded high-cardinality grouping
| summarize count() by user_id  // millions of rows
```

### Field Escaping
Fields with dots need bracket notation:

```apl
| where ['kubernetes.pod.name'] == "frontend"
```

Fields with dots IN the name (not hierarchy) need escaping:

```apl
| where ['kubernetes.labels.app\\.kubernetes\\.io/name'] == "frontend"
```

### Golden Signal Queries

**Traffic:**
```apl
| summarize requests = count() by bin_auto(_time)
```

**Errors (as rate %):**
```apl
| summarize total = count(), errors = countif(status >= 500) by bin_auto(_time)
| extend error_rate = iff(total > 0, round(100.0 * errors / total, 2), 0.0)
| project _time, error_rate
```

**Latency (use percentiles_array for proper chart overlay):**
```apl
| summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time)
```

---

## Layout Composition

### Grid Principles
- Dashboard width = 12 units
- Typical panel: w=3 (quarter), w=4 (third), w=6 (half), w=12 (full)
- Stats row: 4 panels × w=3, h=2
- TimeSeries row: 2 panels × w=6, h=4
- Tables: w=6 or w=12, h=4–6
- LogStream: w=12, h=6–8

### Section Layout Pattern

```
Row 0-1:  [Stat w=3] [Stat w=3] [Stat w=3] [Stat w=3]
Row 2-5:  [TimeSeries w=6, h=4] [TimeSeries w=6, h=4]
Row 6-9:  [Table w=6, h=4] [Pie w=6, h=4]
Row 10+:  [LogStream w=12, h=6]
```

### Naming Conventions
- Use question-style titles: "Error rate by route" not "Errors"
- Prefix with context if multi-service: "[API] Error rate"
- Include units: "Latency (ms)", "Traffic (req/s)"

---

## Dashboard Settings

### Refresh Rate
Dashboard auto-refreshes at configured interval. Options: 15s, 30s, 1m, 5m, etc.

**⚠️ Query cost warning:** Short refresh (15s) + long time range (90d) = expensive queries running constantly.

Recommendations:
| Use case | Refresh rate |
|----------|-------------|
| Oncall/real-time | 15s–30s |
| Team health | 1m–5m |
| Executive/weekly | 5m–15m |

### Sharing
- **Just Me**: Private, only you can access
- **Group**: Specific team/group in your org
- **Everyone**: All users in your Axiom org

Data visibility is still governed by dataset permissions—users only see data from datasets they can access.

### URL Time Range Parameters

`?t_qr=24h` (quick range), `?t_ts=...&t_te=...` (custom), `?t_against=-1d` (comparison)

---

## Setup

Run `scripts/setup` to check requirements (curl, jq, ~/.axiom.toml).

Config in `~/.axiom.toml` (shared with axiom-sre):
```toml
[deployments.prod]
url = "https://api.axiom.co"
token = "xaat-your-token"
org_id = "your-org-id"
```

---

## Deployment

### Scripts

| Script | Usage |
|--------|-------|
| `scripts/get-user-id <deploy>` | Get your user ID for `owner` field |
| `scripts/dashboard-list <deploy>` | List all dashboards |
| `scripts/dashboard-get <deploy> <id>` | Fetch dashboard JSON |
| `scripts/dashboard-validate <file>` | Validate JSON structure |
| `scripts/dashboard-create <deploy> <file>` | Create dashboard |
| `scripts/dashboard-update <deploy> <id> <file>` | Update (needs version) |
| `scripts/dashboard-copy <deploy> <id>` | Clone dashboard |
| `scripts/dashboard-link <deploy> <id>` | Get shareable URL |
| `scripts/dashboard-delete <deploy> <id>` | Delete (with confirm) |
| `scripts/axiom-api <deploy> <method> <path>` | Low-level API calls |

### Workflow

**⚠️ CRITICAL: Always validate queries BEFORE deploying.**

1. Design dashboard (sections + panels)
2. Write APL for each panel
3. Build JSON (from template or manually)
4. **Validate queries** using axiom-sre with explicit time filter
5. `dashboard-validate` to check structure
6. `dashboard-create` or `dashboard-update` to deploy
7. **`dashboard-link` to get URL** — NEVER construct Axiom URLs manually (org IDs and base URLs vary per deployment)
8. Share link with user

---

## Sibling Skill Integration

**spl-to-apl:** Translate Splunk SPL → APL. Map `timechart` → TimeSeries, `stats` → Statistic/Table. See `reference/splunk-migration.md`.

**axiom-sre:** Discover schema with `getschema`, explore baselines, identify dimensions, then productize into panels.

---

## Templates

Pre-built templates in `reference/templates/`:

| Template | Use case |
|----------|----------|
| `service-overview.json` | Single service oncall dashboard with Heatmap |
| `service-overview-with-filters.json` | Same with SmartFilter (route/status dropdowns) |
| `api-health.json` | HTTP API with traffic/errors/latency |
| `blank.json` | Minimal skeleton |

**Placeholders:** `{{owner_id}}`, `{{service}}`, `{{dataset}}`

**Usage:**
```bash
USER_ID=$(scripts/get-user-id prod)
scripts/dashboard-from-template service-overview "my-service" "$USER_ID" "my-dataset" ./dashboard.json
scripts/dashboard-validate ./dashboard.json
scripts/dashboard-create prod ./dashboard.json
```

**⚠️ Templates assume field names** (`service`, `status`, `route`, `duration_ms`). Discover your schema first and use `sed` to fix mismatches.

---

## Common Pitfalls

| Problem | Cause | Solution |
|---------|-------|----------|
| "unable to find dataset" errors | Dataset name doesn't exist in your org | Check available datasets in Axiom UI |
| "creating dashboards for other users" 403 | Owner ID doesn't match your token | Use `scripts/get-user-id prod` to get your UUID |
| All panels show errors | Field names don't match your schema | Discover schema first, use sed to fix field names |
| Dashboard shows no data | Service filter too restrictive | Remove or adjust `where service == 'x'` filters |
| Queries time out | Missing time filter or too broad | Dashboard inherits time from picker; ad-hoc queries need explicit time filter |
| Wrong org in dashboard URL | Manually constructed URL | **Always use `dashboard-link <deploy> <id>`** — never guess org IDs or base URLs |

---

## Reference

- `reference/chart-config.md` — All chart configuration options (JSON)
- `reference/smartfilter.md` — SmartFilter/FilterBar full configuration
- `reference/chart-cookbook.md` — APL patterns per chart type
- `reference/layout-recipes.md` — Grid layouts and section blueprints
- `reference/splunk-migration.md` — Splunk panel → Axiom mapping
- `reference/design-playbook.md` — Decision-first design principles
- `reference/templates/` — Ready-to-use dashboard JSON files

For APL syntax: https://axiom.co/docs/apl/introduction

Overview

This skill designs and builds Axiom dashboards via the API, covering chart types, APL patterns, SmartFilters, layout, and configuration options. It guides you from intake and blueprinting through validation and deployment, with migration tips from Splunk and templates to accelerate delivery. The focus is on decision-driven dashboards that surface golden signals and clear drilldowns.

How this skill works

You define audience, questions, datasets, and drilldown dimensions, then author APL queries for panels and assemble a dashboard JSON payload. The skill provides patterns for TimeSeries, Statistic, Table, Pie, LogStream, SmartFilter and layout grids, plus scripts to validate and deploy via the Axiom API. Validation steps include running ad-hoc queries with explicit time filters and using provided scripts to check structure before create/update.

When to use it

Creating a new operational dashboard for oncall or team visibility
Migrating Splunk dashboards to APL and re-mapping chart types
Configuring SmartFilters and query parameters for interactive dashboards
Standardizing dashboards across services with templates
Building drilldown flows from overview → trends → breakdowns → raw logs

Best practices

Start with decisions: each panel answers a single question and leads to an action
Use overview → drilldown → evidence structure: stats, trends, breakdowns, logs
Prefer percentiles and rates (p95/p99, error rate) over averages to surface issues
Use bin_auto(_time) for time series and percentiles_array for latency overlays
Protect against high cardinality: always top N for breakdowns and limit LogStream with take N
Validate queries with explicit time filters in exploration before deploying dashboard panels

Example use cases

Oncall service overview showing error rate, p95 latency, request rate and active alerts
Team health dashboard with trends for traffic, error trends, and SLO tracking
Migration of a Splunk timechart/stats dashboard to APL using spl-to-apl mappings
Interactive troubleshooting dashboard with SmartFilters for service, route, and environment
Executive summary dashboard with large single-number stats and runbook links

FAQ

Do dashboard panel queries need explicit time filters?

No. Dashboard queries inherit the UI time picker. Only ad-hoc exploration queries require explicit time filters.

How do SmartFilters connect to panel queries?

Create a SmartFilter chart and declare query_parameters in each panel. Use isempty checks or parameter equality in where clauses to apply filters.