home / skills / fcakyon / claude-codex-settings / gcloud-usage

This skill helps you diagnose GCP observability issues by guiding structured logging, queries, and tracing to accelerate debugging on GCP.

npx playbooks add skill fcakyon/claude-codex-settings --skill gcloud-usage

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.4 KB
---
name: gcloud-usage
description: This skill should be used when user asks about "GCloud logs", "Cloud Logging queries", "Google Cloud metrics", "GCP observability", "trace analysis", or "debugging production issues on GCP".
---

# GCP Observability Best Practices

## Structured Logging

### JSON Log Format

Use structured JSON logging for better queryability:

```json
{
  "severity": "ERROR",
  "message": "Payment failed",
  "httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" },
  "labels": { "user_id": "123", "transaction_id": "abc" },
  "timestamp": "2025-01-15T10:30:00Z"
}
```

### Severity Levels

Use appropriate severity for filtering:

- **DEBUG:** Detailed diagnostic info
- **INFO:** Normal operations, milestones
- **NOTICE:** Normal but significant events
- **WARNING:** Potential issues, degraded performance
- **ERROR:** Failures that don't stop the service
- **CRITICAL:** Failures requiring immediate action
- **ALERT:** Person must take action immediately
- **EMERGENCY:** System is unusable

## Log Filtering Queries

### Common Filters

```
# By severity
severity >= WARNING

# By resource
resource.type="cloud_run_revision"
resource.labels.service_name="my-service"

# By time
timestamp >= "2025-01-15T00:00:00Z"

# By text content
textPayload =~ "error.*timeout"

# By JSON field
jsonPayload.user_id = "123"

# Combined
severity >= ERROR AND resource.labels.service_name="api"
```

### Advanced Queries

```
# Regex matching
textPayload =~ "status=[45][0-9]{2}"

# Substring search
textPayload : "connection refused"

# Multiple values
severity = (ERROR OR CRITICAL)
```

## Metrics vs Logs vs Traces

### When to Use Each

**Metrics:** Aggregated numeric data over time

- Request counts, latency percentiles
- Resource utilization (CPU, memory)
- Business KPIs (orders/minute)

**Logs:** Detailed event records

- Error details and stack traces
- Audit trails
- Debugging specific requests

**Traces:** Request flow across services

- Latency breakdown by service
- Identifying bottlenecks
- Distributed system debugging

## Alert Policy Design

### Alert Best Practices

- **Avoid alert fatigue:** Only alert on actionable issues
- **Use multi-condition alerts:** Reduce noise from transient spikes
- **Set appropriate windows:** 5-15 min for most metrics
- **Include runbook links:** Help responders act quickly

### Common Alert Patterns

**Error rate:**

- Condition: Error rate > 1% for 5 minutes
- Good for: Service health monitoring

**Latency:**

- Condition: P99 latency > 2s for 10 minutes
- Good for: Performance degradation detection

**Resource exhaustion:**

- Condition: Memory > 90% for 5 minutes
- Good for: Capacity planning triggers

## Cost Optimization

### Reducing Log Costs

- **Exclusion filters:** Drop verbose logs at ingestion
- **Sampling:** Log only percentage of high-volume events
- **Shorter retention:** Reduce default 30-day retention
- **Downgrade logs:** Route to cheaper storage buckets

### Exclusion Filter Examples

```
# Exclude health checks
resource.type="cloud_run_revision" AND httpRequest.requestUrl="/health"

# Exclude debug logs in production
severity = DEBUG
```

## Debugging Workflow

1. **Start with metrics:** Identify when issues started
2. **Correlate with logs:** Filter logs around problem time
3. **Use traces:** Follow specific requests across services
4. **Check resource logs:** Look for infrastructure issues
5. **Compare baselines:** Check against known-good periods

Overview

This skill helps engineers inspect and troubleshoot Google Cloud observability: logs, metrics, and traces. It provides practical guidance for building structured logs, composing Cloud Logging queries, designing alert policies, and optimizing log costs. Use it to streamline debugging and improve signal quality across GCP services.

How this skill works

The skill inspects common GCP observability patterns and recommends concrete actions: JSON structured logging, severity use, log query examples, and alert conditions. It outlines a step-by-step debugging workflow that starts with metrics, narrows with logs, and resolves with traces and infrastructure checks. It also suggests cost controls like exclusion filters and sampling.

When to use it

  • When you need precise Cloud Logging queries for debugging production incidents
  • When designing or refining alerting policies to reduce noise
  • When you want to implement structured JSON logs for easier querying
  • When investigating latency or request-flow issues with traces
  • When reducing logging costs through exclusions and sampling

Best practices

  • Emit structured JSON logs with clear severity, timestamp, labels, and request context
  • Choose severity levels intentionally: DEBUG → EMERGENCY for filtering and escalation
  • Start investigations with metrics, then correlate timestamps with logs and traces
  • Use multi-condition alerts and reasonable windows (5–15 minutes) to avoid fatigue
  • Apply exclusion filters and sampling to high-volume, low-value logs to save cost

Example use cases

  • Find all recent production errors: severity >= ERROR AND resource.labels.service_name="api"
  • Track a single request across services using traces to identify a latency hotspot
  • Create an alert: P99 latency > 2s for 10 minutes to detect performance regressions
  • Exclude health check traffic with an ingestion filter: requestUrl="/health"
  • Reduce debug log ingestion by sampling or routing DEBUG severity to cheaper storage

FAQ

When should I use metrics vs logs vs traces?

Use metrics for aggregated numeric trends (latency, error rate), logs for detailed events and stack traces, and traces to follow request flow and identify service bottlenecks.

How do I avoid alert fatigue?

Alert only on actionable conditions, combine multiple conditions, use appropriate evaluation windows, and include runbook links so responders can act quickly.