home / skills / velcrafting / codex-skills / observability-audit

observability-audit skill

safe

This skill standardizes observability during changes by improving logs, metrics, traces, and audit entries to enhance operability.

npx playbooks add skill velcrafting/codex-skills --skill observability-audit

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.7 KB

---
name: observability-audit
description: Ensure logging/metrics/tracing and auditability match the quality bar for changed behavior.
metadata:
  short-description: Observability pass
  layer: backend
  mode: write
  idempotent: true
---

# Skill: backend/observability-audit

## Purpose
Improve and standardize observability so the system is operable after changes:
- logs: structured, queryable, minimal but sufficient
- metrics: counters/timers/gauges for key outcomes
- traces: propagation and spans where available
- audit: records for sensitive actions when required

This skill is a pass over changed behavior, not feature work.

---

## Inputs
- Changed components/endpoints/jobs
- Risk posture (from repo docs/profile if present)
- Observability standards (from `REPO_PROFILE.json` if present)
- Correlation/id propagation standard (if any)

---

## Outputs
- Improved logging:
  - include correlation id
  - include primary identifiers (request id, user id where safe, entity id)
  - include outcome + error codes
- Metrics updates if repo uses them:
  - success/fail counters
  - latency timing
  - queue depth / retries (for jobs)
- Tracing spans or propagation fixes if repo uses tracing
- Audit entries for protected actions if applicable

---

## Non-goals
- Implementing business logic
- Adding heavy instrumentation everywhere
- Logging secrets or sensitive payloads

---

## Workflow
1) Identify repo observability standards (profile/docs).
2) Add structured logs at critical boundaries:
   - request start/end
   - job start/end
   - external call start/end (adapter boundary)
3) Ensure errors are observable:
   - log error codes/taxonomy, not raw stack spam only
   - include retry decisions (retrying vs terminal)
4) Add metrics if the repo uses them:
   - count outcomes
   - measure latency
5) Ensure correlation ids propagate:
   - inbound request → domain → adapter → logs
6) Add audit entries where required:
   - authz-protected actions
   - fund movement / order placement / credential updates (example categories)
7) Run validations.

---

## Checks
- Logs answer: what happened, to what, by whom (when safe), and why
- No secrets logged; sensitive values redacted
- Correlation id present on critical flows (where available)
- Metrics/tracing updated when repo supports them
- Changes are minimal and focused on the modified behavior

---

## Failure modes
- No logging standard → follow conservative structured logging and document assumptions.
- High-cardinality metrics risk → avoid unbounded labels.
- Sensitive data risk → redact or omit; prefer identifiers over payloads.

---

## Telemetry
Log:
- skill: `backend/observability-audit`
- areas: `logs | metrics | tracing | audit` (subset)
- files_touched
- outcome: `success | partial | blocked`

Overview

This skill ensures logging, metrics, tracing, and auditability meet an operational quality bar for modified behavior. It focuses on small, focused changes so systems remain observable and operable after code changes. The goal is deterministic, minimally invasive instrumentation that supports debugging, alerting, and compliance without adding business logic.

How this skill works

The skill inspects changed components, repository observability standards, and correlation/id propagation rules to identify gaps. It adds or adjusts structured logs at critical boundaries, updates or adds metrics and tracing spans where supported, and inserts audit records for protected actions. Finally, it runs validations to verify correlation IDs, redaction, and metric cardinality constraints.

When to use it

After code changes that affect endpoints, jobs, or external adapters
When adding or modifying behavior that impacts user or financial actions
When observability standards in the repo are incomplete or ambiguous
Prior to release or deploy to ensure operability
When onboarding a new integration or external dependency

Best practices

Keep logs structured and minimal: include correlation id, primary safe identifiers, outcome and error codes
Avoid logging secrets or raw sensitive payloads; redact or omit high-risk fields
Add counters and latency metrics for key outcomes; avoid unbounded label cardinality
Instrument request/job boundaries and adapter calls to capture end-to-end flow
Prefer adding propagation fixes over wide instrumentation; focus on changed behavior and critical paths

Example use cases

A modified API handler needs correlation id propagation, structured start/end logs, and a success/failure counter
A background job change requires queue-depth metric, retry counters, and latency timing
An external payment adapter update needs tracing spans across the adapter and audit entries for fund movements
A permission change must record audit entries for protected actions and include outcome codes in logs
A refactor broke header propagation; add propagation fixes and validate logs show consistent correlation ids

FAQ

What does this skill not do?

It does not implement business logic or add heavy instrumentation across the entire codebase. It focuses on targeted observability for changed behavior.

How are sensitive values handled?

Sensitive fields must be redacted or omitted. The skill prefers identifiers over payloads and documents assumptions if a logging standard is absent.