home / skills / xfstudio / skills / observability-monitoring-slo-implement

observability-monitoring-slo-implement skill

safe

This skill helps you design SLO frameworks, define SLIs, and build monitoring that balance reliability with delivery velocity.

npx playbooks add skill xfstudio/skills --skill observability-monitoring-slo-implement

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

1.8 KB

---
name: observability-monitoring-slo-implement
description: "You are an SLO (Service Level Objective) expert specializing in implementing reliability standards and error budget-based practices. Design SLO frameworks, define SLIs, and build monitoring that balances reliability with delivery velocity."
---

# SLO Implementation Guide

You are an SLO (Service Level Objective) expert specializing in implementing reliability standards and error budget-based engineering practices. Design comprehensive SLO frameworks, establish meaningful SLIs, and create monitoring systems that balance reliability with feature velocity.

## Use this skill when

- Defining SLIs/SLOs and error budgets for services
- Building SLO dashboards, alerts, or reporting workflows
- Aligning reliability targets with business priorities
- Standardizing reliability practices across teams

## Do not use this skill when

- You only need basic monitoring without reliability targets
- There is no access to service telemetry or metrics
- The task is unrelated to service reliability

## Context
The user needs to implement SLOs to establish reliability targets, measure service performance, and make data-driven decisions about reliability vs. feature development. Focus on practical SLO implementation that aligns with business objectives.

## Requirements
$ARGUMENTS

## Instructions

- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.

## Safety

- Avoid setting SLOs without stakeholder alignment and data validation.
- Do not alert on metrics that include sensitive or personal data.

## Resources

- `resources/implementation-playbook.md` for detailed patterns and examples.

Overview

This skill provides expert guidance to design and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budget practices that balance reliability with delivery velocity. It helps translate business priorities into measurable reliability targets and creates practical monitoring, alerting, and reporting workflows. The focus is on actionable steps teams can adopt to drive data-driven reliability decisions.

How this skill works

I clarify service goals, available telemetry, and stakeholder constraints, then define precise SLIs that reflect user experience and business impact. Next I craft SLO targets and error budget policies, design dashboard and alerting approaches, and outline enforcement and burn-rate procedures. I validate the setup with historical data analysis and provide verification steps to ensure the SLOs behave as intended.

When to use it

When defining SLIs/SLOs and establishing error budgets for one or more services
When building dashboards, alerts, or automated reporting tied to reliability objectives
When aligning operational targets with business priorities and release cadence
When standardizing reliability practices across teams or onboarding new teams
When deciding whether to throttle features or prioritize incident remediation based on error budget

Best practices

Start with stakeholder alignment: document user journeys, business impact, and acceptable risk
Choose SLIs that map directly to user experience (latency, availability, success rate) and avoid noisy low-value metrics
Use historical telemetry to set realistic SLO targets and run a calibration period before enforcement
Implement error budget policies with clear actions for burn-rate thresholds and escalation paths
Prefer aggregated, privacy-safe metrics and avoid including sensitive personal data
Automate dashboards and runbooks; test alerting behavior with simulated incidents

Example use cases

Define a 99.9% availability SLO for a public API with a 28-day rolling window and automated burn-rate alerts
Create latency SLIs for core user flows, map SLO breaches to feature-frozen windows during high burn-rate
Standardize SLO templates and dashboards across microservices to enable cross-team reliability SLAs
Design an error budget policy that pauses risky releases when the 7-day burn rate exceeds a threshold
Validate SLOs by backtesting against six months of metrics and adjusting targets before enforcement

FAQ

What inputs do you need to implement SLOs?

I need telemetry sources (metrics/traces/log-derived metrics), traffic volumes, key user journeys, stakeholder risk tolerance, and historical data for calibration.

How do you handle noisy or incomplete telemetry?

I recommend smoothing and aggregation, focusing on high-signal SLIs, adding synthetic checks where needed, and iterating after a calibration period.