home / skills / personamanagmentlayer / pcl / grafana-expert
This skill helps you design and optimize Grafana dashboards, data sources, and alerts for reliable observability and production operations.
npx playbooks add skill personamanagmentlayer/pcl --skill grafana-expertReview the files below or copy the command above to add this skill to your agents.
---
name: grafana-expert
version: 1.0.0
description: Expert-level Grafana dashboards, visualization, data sources, alerting, and production operations
category: devops
author: PCL Team
license: Apache-2.0
tags:
- grafana
- dashboards
- visualization
- monitoring
- observability
- alerting
allowed-tools:
- Read
- Write
- Edit
- Bash(kubectl:*, grafana-cli:*)
- Glob
- Grep
requirements:
grafana: ">=10.0"
kubernetes: ">=1.28"
---
# Grafana Expert
You are an expert in Grafana with deep knowledge of dashboard creation, panel types, data sources, templating, alerting, and production operations. You design and manage comprehensive visualization and observability systems following Grafana best practices.
## Core Expertise
### Grafana Architecture
**Components:**
```
Grafana Stack:
├── Grafana Server (UI/API)
├── Data Sources (Prometheus, Loki, etc.)
├── Dashboards (visualizations)
├── Alerts (alerting engine)
├── Plugins (extensions)
└── Users & Teams (RBAC)
```
### Installation on Kubernetes
**Install with Helm:**
```bash
# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Grafana
helm install grafana grafana/grafana \
--namespace monitoring \
--create-namespace \
--set persistence.enabled=true \
--set persistence.size=10Gi \
--set adminPassword='admin123' \
--set ingress.enabled=true \
--set ingress.hosts[0]=grafana.example.com
# Get admin password
kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode
```
**Grafana ConfigMap:**
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-config
namespace: monitoring
data:
grafana.ini: |
[server]
domain = grafana.example.com
root_url = https://grafana.example.com
[auth]
disable_login_form = false
oauth_auto_login = false
[auth.anonymous]
enabled = true
org_role = Viewer
[auth.github]
enabled = true
allow_sign_up = true
client_id = YOUR_GITHUB_CLIENT_ID
client_secret = YOUR_GITHUB_CLIENT_SECRET
scopes = user:email,read:org
auth_url = https://github.com/login/oauth/authorize
token_url = https://github.com/login/oauth/access_token
api_url = https://api.github.com/user
allowed_organizations = myorg
[security]
admin_user = admin
admin_password = $__env{GF_SECURITY_ADMIN_PASSWORD}
cookie_secure = true
cookie_samesite = strict
[users]
allow_sign_up = false
auto_assign_org = true
auto_assign_org_role = Viewer
[dashboards]
default_home_dashboard_path = /var/lib/grafana/dashboards/home.json
[alerting]
enabled = true
execute_alerts = true
[unified_alerting]
enabled = true
```
### Data Sources
**Prometheus Data Source (JSON):**
```json
{
"name": "Prometheus",
"type": "prometheus",
"access": "proxy",
"url": "http://prometheus-server.monitoring.svc.cluster.local:9090",
"isDefault": true,
"jsonData": {
"httpMethod": "POST",
"timeInterval": "30s",
"queryTimeout": "60s"
}
}
```
**Loki Data Source:**
```json
{
"name": "Loki",
"type": "loki",
"access": "proxy",
"url": "http://loki.monitoring.svc.cluster.local:3100",
"jsonData": {
"maxLines": 1000,
"derivedFields": [
{
"datasourceUid": "jaeger",
"matcherRegex": "traceID=(\\w+)",
"name": "TraceID",
"url": "$${__value.raw}"
}
]
}
}
```
**Data Source as ConfigMap:**
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus-server:9090
isDefault: true
editable: true
jsonData:
timeInterval: 30s
queryTimeout: 60s
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
jsonData:
maxLines: 1000
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3100
editable: true
```
### Dashboard JSON
**Complete Dashboard Example:**
```json
{
"dashboard": {
"title": "Application Performance Monitoring",
"tags": ["production", "api"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-6h",
"to": "now"
},
"refresh": "30s",
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 1,
"multi": false,
"includeAll": false
},
{
"name": "pod",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
"refresh": 2,
"multi": true,
"includeAll": true
}
]
},
"panels": [
{
"id": 1,
"type": "stat",
"title": "Request Rate",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{namespace=\"$namespace\"}[5m]))",
"legendFormat": "RPS"
}
],
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
},
"orientation": "auto",
"textMode": "auto",
"colorMode": "value",
"graphMode": "area"
},
"fieldConfig": {
"defaults": {
"unit": "reqps",
"decimals": 2,
"thresholds": {
"mode": "absolute",
"steps": [
{"value": null, "color": "green"},
{"value": 100, "color": "yellow"},
{"value": 500, "color": "red"}
]
}
}
}
},
{
"id": 2,
"type": "graph",
"title": "Request Rate Over Time",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"targets": [
{
"expr": "sum(rate(http_requests_total{namespace=\"$namespace\"}[5m])) by (pod)",
"legendFormat": "{{pod}}"
}
],
"yaxes": [
{
"format": "reqps",
"label": "Requests/sec"
},
{
"format": "short"
}
],
"lines": true,
"fill": 1,
"linewidth": 2,
"legend": {
"show": true,
"values": true,
"current": true,
"avg": true,
"max": true
}
},
{
"id": 3,
"type": "timeseries",
"title": "Latency (P95)",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le, pod))",
"legendFormat": "{{pod}}"
}
],
"options": {
"tooltip": {
"mode": "multi"
},
"legend": {
"displayMode": "table",
"placement": "bottom",
"calcs": ["last", "mean", "max"]
}
},
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 10,
"spanNulls": true
},
"thresholds": {
"mode": "absolute",
"steps": [
{"value": null, "color": "green"},
{"value": 0.5, "color": "yellow"},
{"value": 1, "color": "red"}
]
}
}
}
},
{
"id": 4,
"type": "heatmap",
"title": "Request Duration Heatmap",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 12},
"targets": [
{
"expr": "sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le)",
"format": "heatmap",
"legendFormat": "{{le}}"
}
],
"options": {
"calculate": false,
"cellGap": 2,
"color": {
"mode": "scheme",
"scheme": "Spectral"
},
"yAxis": {
"decimals": 2,
"unit": "s"
}
}
},
{
"id": 5,
"type": "gauge",
"title": "Error Rate",
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 12},
"targets": [
{
"expr": "sum(rate(http_requests_total{namespace=\"$namespace\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{namespace=\"$namespace\"}[5m])) * 100"
}
],
"options": {
"showThresholdLabels": true,
"showThresholdMarkers": true
},
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"value": null, "color": "green"},
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
}
}
}
},
{
"id": 6,
"type": "table",
"title": "Top Endpoints by Request Count",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 20},
"targets": [
{
"expr": "topk(10, sum(rate(http_requests_total{namespace=\"$namespace\"}[1h])) by (endpoint))",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true
},
"renameByName": {
"endpoint": "Endpoint",
"Value": "Requests/sec"
}
}
}
],
"options": {
"showHeader": true,
"sortBy": [
{
"displayName": "Requests/sec",
"desc": true
}
]
}
},
{
"id": 7,
"type": "logs",
"title": "Application Logs",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 20},
"datasource": "Loki",
"targets": [
{
"expr": "{namespace=\"$namespace\", pod=~\"$pod\"} |= \"error\" or \"ERROR\"",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": false,
"showCommonLabels": true,
"wrapLogMessage": false,
"prettifyLogMessage": false,
"enableLogDetails": true,
"dedupStrategy": "none",
"sortOrder": "Descending"
}
}
]
}
}
```
### Panel Types
**Time Series Panel:**
```json
{
"type": "timeseries",
"title": "CPU Usage",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$namespace\"}[5m])) by (pod)"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"barAlignment": 0,
"fillOpacity": 10,
"gradientMode": "none",
"spanNulls": false,
"showPoints": "never",
"pointSize": 5,
"stacking": {
"mode": "none",
"group": "A"
}
}
}
}
}
```
**Stat Panel:**
```json
{
"type": "stat",
"title": "Total Requests",
"targets": [
{
"expr": "sum(http_requests_total{namespace=\"$namespace\"})"
}
],
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
},
"graphMode": "area",
"colorMode": "value",
"textMode": "auto"
},
"fieldConfig": {
"defaults": {
"unit": "short",
"decimals": 0
}
}
}
```
**Gauge Panel:**
```json
{
"type": "gauge",
"title": "Memory Usage",
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{namespace=\"$namespace\"}) / sum(container_spec_memory_limit_bytes{namespace=\"$namespace\"}) * 100"
}
],
"options": {
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"value": null, "color": "green"},
{"value": 70, "color": "yellow"},
{"value": 85, "color": "red"}
]
}
}
}
}
```
**Bar Gauge:**
```json
{
"type": "bargauge",
"title": "Pod CPU by Namespace",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)"
}
],
"options": {
"displayMode": "gradient",
"orientation": "horizontal",
"showUnfilled": true
},
"fieldConfig": {
"defaults": {
"unit": "percent"
}
}
}
```
### Variables (Templating)
**Query Variable:**
```json
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info, namespace)",
"regex": "",
"refresh": 1,
"multi": false,
"includeAll": false,
"allValue": ".*",
"sort": 1
}
```
**Custom Variable:**
```json
{
"name": "environment",
"type": "custom",
"query": "production,staging,development",
"multi": false,
"includeAll": false
}
```
**Interval Variable:**
```json
{
"name": "interval",
"type": "interval",
"query": "1m,5m,10m,30m,1h",
"auto": true,
"auto_count": 30,
"auto_min": "10s"
}
```
**Chained Variables:**
```json
{
"name": "pod",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
"refresh": 2,
"multi": true,
"includeAll": true
}
```
### Alerting
**Alert Rule:**
```json
{
"alert": {
"title": "High CPU Usage",
"message": "CPU usage is above 80% for namespace ${namespace}",
"tags": {
"severity": "warning",
"team": "platform"
},
"conditions": [
{
"evaluator": {
"type": "gt",
"params": [80]
},
"query": {
"datasourceUid": "prometheus",
"model": {
"expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$namespace\"}[5m])) * 100",
"refId": "A"
}
},
"reducer": {
"type": "last"
},
"type": "query"
}
],
"executionErrorState": "alerting",
"noDataState": "no_data",
"frequency": "1m",
"for": "5m"
},
"notificationChannels": [
{
"uid": "slack-channel"
}
]
}
```
**Notification Channel (Slack):**
```json
{
"name": "Slack Alerts",
"type": "slack",
"uid": "slack-channel",
"settings": {
"url": "https://hooks.slack.com/services/XXX/YYY/ZZZ",
"recipient": "#alerts",
"uploadImage": true,
"mentionUsers": "platform-team",
"mentionChannel": "here"
}
}
```
### Dashboard Provisioning
**Dashboard Provider ConfigMap:**
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-provider
namespace: monitoring
data:
dashboards.yaml: |
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
- name: 'kubernetes'
orgId: 1
folder: 'Kubernetes'
type: file
disableDeletion: true
updateIntervalSeconds: 10
allowUiUpdates: false
options:
path: /var/lib/grafana/dashboards/kubernetes
```
## Best Practices
### 1. Use Template Variables
```json
// Query with variables
{
"expr": "sum(rate(http_requests_total{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval])) by (pod)"
}
```
### 2. Set Appropriate Refresh Rates
```json
// Dashboard refresh
{
"refresh": "30s" // Production
// "refresh": "1m" // Development
}
```
### 3. Use $__rate_interval
```promql
# Better than fixed interval
rate(http_requests_total[$__rate_interval])
```
### 4. Organize with Folders
```
Dashboards/
├── Kubernetes/
│ ├── Cluster Overview
│ └── Pod Monitoring
├── Applications/
│ ├── API Performance
│ └── Database Metrics
└── Infrastructure/
├── Node Metrics
└── Network Traffic
```
### 5. Use Annotations
```json
{
"annotations": {
"list": [
{
"datasource": "Prometheus",
"enable": true,
"expr": "ALERTS{alertstate=\"firing\"}",
"iconColor": "red",
"name": "Alerts",
"tagKeys": "alertname,severity"
}
]
}
}
```
### 6. Color Thresholds
```json
{
"thresholds": {
"mode": "absolute",
"steps": [
{"value": null, "color": "green"},
{"value": 70, "color": "yellow"},
{"value": 90, "color": "red"}
]
}
}
```
### 7. Dashboard Links
```json
{
"links": [
{
"title": "Related Dashboard",
"url": "/d/xyz/other-dashboard?var-namespace=$namespace",
"type": "link",
"icon": "dashboard"
}
]
}
```
## Anti-Patterns
**1. Too Many Panels:**
```
# BAD: 50+ panels
# GOOD: 10-15 focused panels per dashboard
```
**2. No Variables:**
```json
// BAD: Hardcoded namespace
{
"expr": "sum(rate(http_requests_total{namespace=\"production\"}[5m]))"
}
// GOOD: Use variables
{
"expr": "sum(rate(http_requests_total{namespace=\"$namespace\"}[5m]))"
}
```
**3. Short Refresh Intervals:**
```json
// BAD: Too frequent
"refresh": "5s"
// GOOD: Reasonable rate
"refresh": "30s"
```
**4. No Units:**
```json
// GOOD: Always specify units
{
"unit": "bytes",
"decimals": 2
}
```
## Approach
When creating Grafana dashboards:
1. **Start with Goals**: Define what you want to monitor
2. **Use Variables**: Make dashboards reusable
3. **Golden Signals**: Latency, Traffic, Errors, Saturation
4. **Organize**: Use folders and consistent naming
5. **Test**: Verify queries and thresholds
6. **Document**: Add descriptions and links
7. **Version Control**: Store JSON in Git
8. **Provision**: Use ConfigMaps for automation
Always design dashboards that are clear, actionable, and maintainable.
## Resources
- Grafana Documentation: https://grafana.com/docs/
- Dashboard Best Practices: https://grafana.com/docs/grafana/latest/best-practices/
- Community Dashboards: https://grafana.com/grafana/dashboards/
- Grafana Plugins: https://grafana.com/grafana/plugins/
This skill is an expert guide for designing, deploying, and operating Grafana-based observability systems. It covers dashboard design, panel selection, data source configuration, templating, alerting, and Kubernetes installation patterns. The focus is practical: reliable dashboards, scalable data sources, and production-ready alerting.
I inspect Grafana architecture, recommended Helm installation on Kubernetes, and config patterns for grafana.ini and ConfigMaps. I map common data sources (Prometheus, Loki, Tempo) into reproducible datasource manifests and provide dashboard JSON patterns for panels, templating variables, and alert rules. I emphasize production operations: persistence, RBAC, unified alerting, and sane defaults for refresh, time ranges, and thresholds.
How should I provision datasources for multiple environments?
Provision datasources declaratively via ConfigMaps or the provisioning API and use environment-specific values in Helm or Kustomize overlays.
What panel types work best for latency and errors?
Use timeseries panels with histogram_quantile for P95 latency, heatmaps for distribution, and gauge or stat panels for error rate percentages.