home / skills / acedergren / oci-agent-skills / monitoring-operations

monitoring-operations skill

/skills/monitoring-operations

This skill helps you configure OCI Monitoring alarms, troubleshoot missing data, and optimize log collection within a Landing Zone for reliable observability.

npx playbooks add skill acedergren/oci-agent-skills --skill monitoring-operations

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
6.1 KB
---
name: monitoring-operations
description: Use when setting up metrics, alarms, or troubleshooting missing data in OCI Monitoring. Covers metric namespace confusion, alarm threshold gotchas, log collection setup, and common monitoring gaps.
license: MIT
metadata:
  author: alexander-cedergren
  version: "2.0.0"
---

# OCI Monitoring and Observability - Expert Knowledge

## 🏗️ Use OCI Landing Zone Terraform Modules

**Don't reinvent the wheel.** Use [oracle-terraform-modules/landing-zone](https://github.com/oracle-terraform-modules/terraform-oci-landing-zones) for observability stack.

**Landing Zone solves:**
- ❌ Bad Practice #10: No logging, monitoring, notifications (Landing Zone deploys complete observability)
- ❌ Bad Practice #7: Limited security services (Landing Zone integrates Cloud Guard, VSS, OSMS)

**This skill provides**: Metrics, alarms, and troubleshooting for monitoring deployed WITHIN a Landing Zone.

---

## ⚠️ OCI CLI/API Knowledge Gap

**You don't know OCI CLI commands or OCI API structure.**

Your training data has limited and outdated knowledge of:
- OCI CLI syntax and parameters (updates monthly)
- OCI API endpoints and request/response formats
- Monitoring service CLI operations (`oci monitoring alarm`, `oci monitoring metric`)
- Metric namespaces and MQL (Monitoring Query Language)
- Latest Logging and Service Connector features

**When OCI operations are needed:**
1. Use exact CLI commands from this skill's references
2. Do NOT guess metric namespace names
3. Do NOT assume AWS CloudWatch patterns work in OCI
4. Load reference files for detailed MQL documentation

**What you DO know:**
- General observability concepts
- Alerting and threshold design principles
- Log aggregation patterns

This skill bridges the gap by providing current OCI-specific monitoring patterns and gotchas.

---

## NEVER Do This

❌ **NEVER assume metrics are instant (10-15 minute lag)**
- Metrics published every 1-5 minutes
- Processing delay: 5-10 minutes
- **Total lag**: 10-15 minutes from event to visible metric
- Don't debug "missing metrics" within first 15 minutes of resource creation

❌ **NEVER use `=` for alarm thresholds with sparse metrics**
```
# WRONG - alarm never fires if metric has gaps
MetricName[1m].mean() = 0

# RIGHT - handle missing data
MetricName[1m]{dataMissing=zero}.mean() > 0
```

❌ **NEVER forget metric dimensions (causes "no data")**
```
# WRONG - missing required dimension
CPUUtilization[1m].mean()

# RIGHT - include resourceId dimension
CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()
```

❌ **NEVER set alarm thresholds without trigger delay (alert fatigue)**
```
# BAD - fires on every CPU spike
CPUUtilization[1m].mean() > 80

# BETTER - sustained high CPU
CPUUtilization[5m].mean() > 80
Trigger delay: 5 minutes (fires after 5 consecutive breaches)
```

❌ **NEVER create alarms without notification channels**
```
# WRONG - alarm fires but nobody knows
oci monitoring alarm create ... --destinations '[]'

# RIGHT - always link to notification topic
oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'
```
Cost impact: Undetected outages cost $5,000-50,000/hour in production

❌ **NEVER ignore Cloud Guard findings (security audit failure)**
- Cloud Guard detects misconfigurations BEFORE they become incidents
- Integrate Cloud Guard → Notifications → Email/Slack/PagerDuty
- Cost impact: $100,000+ per security breach vs $0 for proactive remediation

## Metric Namespace Gotchas

**OCI Metrics Use Service-Specific Namespaces:**

| Service | Namespace | Example Metric |
|---------|-----------|----------------|
| Compute | `oci_computeagent` | `CPUUtilization`, `MemoryUtilization` |
| Autonomous DB | `oci_autonomous_database` | `CpuUtilization`, `StorageUtilization` |
| Load Balancer | `oci_lbaas` | `HttpRequests`, `UnHealthyBackendServers` |
| Object Storage | `oci_objectstorage` | `ObjectCount`, `BytesUploaded` |

**Common Mistake**: Using wrong namespace (`oci_compute` vs `oci_computeagent`)

## Alarm Missing Data Handling

| Setting | Behavior | Use When |
|---------|----------|----------|
| `treatMissingDataAsBreaching` | Alarm fires if no data | Critical services (outage = breach) |
| `treatMissingDataAsNotBreaching` | Alarm silent if no data | Optional monitoring |
| `{dataMissing=zero}` | Treat missing as 0 | Counters (requests/sec) |

## Log Collection Common Gaps

**Problem**: Logs not showing in Log Analytics

```
Logs not appearing?
├─ Is log enabled on resource?
│  └─ Compute: oci-compute-agent must be running
│  └─ Function: Logging enabled in function config
│
├─ Is Service Connector configured?
│  └─ Source: Log Group → Target: Log Analytics
│  └─ Check: Service Connector status = ACTIVE
│
├─ IAM policy for Service Connector?
│  └─ "Allow any-user to use log-content in tenancy"
│  └─ "Allow service loganalytics to READ logcontent in tenancy"
│
└─ 10-15 minute ingestion lag?
   └─ Wait before debugging
```

## Metric Query Optimization

**Expensive** (slow):
```
# Queries ALL instances
CPUUtilization[1m].mean()
```

**Optimized** (filter by dimension):
```
# Query specific instance
CPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()
```

**Cost**: Queries free, but rate limited (1000 req/min)

## Progressive Loading References

### OCI Monitoring Reference (Official Oracle Documentation)

**WHEN TO LOAD** [`oci-monitoring-reference.md`](references/oci-monitoring-reference.md):
- Need comprehensive list of all OCI service metrics
- Understanding MQL (Monitoring Query Language) in depth
- Implementing complex alarm conditions and composites
- Need official Oracle guidance on Logging and Service Connector
- Setting up Log Analytics and APM integration

**Do NOT load** for:
- Quick alarm setup (examples in this skill)
- Common metric patterns (tables above)
- Troubleshooting decision trees (covered above)

---

## When to Use This Skill

- Alarms: threshold configuration, missing data handling, trigger delay
- Troubleshooting: metrics not showing, alarms not firing, namespace errors
- Log collection: Service Connector, IAM policies, missing logs
- Performance: query optimization, dimension filtering

Overview

This skill helps engineers set up, tune, and troubleshoot Oracle Cloud Infrastructure (OCI) Monitoring and Observability. It focuses on metric namespaces, alarm threshold and missing-data behavior, log collection gaps, and query optimization to reduce noisy alerts and blind spots. Use it to ensure reliable alerts and complete log ingestion across OCI resources.

How this skill works

I inspect common configuration mistakes and runtime gaps: wrong metric namespaces, missing metric dimensions, alarm MQL pitfalls, and Service Connector or IAM issues that block log flow. The guidance explains how OCI publishes metrics (collection cadence and lag), how to handle sparse or missing data in alarms, and how to filter metrics by dimensions to improve accuracy and performance. I also point to when to load the official OCI monitoring reference for deep MQL and metric details.

When to use it

  • Creating or reviewing OCI alarms and notification bindings
  • Troubleshooting metrics that show “no data” or delayed values
  • Configuring log collection to Log Analytics via Service Connector
  • Tuning alarm thresholds to avoid alert fatigue or missed outages
  • Optimizing metric queries to limit scope and rate usage

Best practices

  • Allow 10–15 minutes after resource creation before debugging missing metrics due to collection and processing lag
  • Always include required metric dimensions (for example resourceId) in queries and alarms
  • Handle missing data explicitly in MQL: use dataMissing=zero or treatMissingData settings as appropriate
  • Use sustained-window thresholds and trigger delays (e.g., 5m mean with a 5-minute trigger) to reduce noisy alerts
  • Always attach alarms to notification topics or channels and validate delivery during runbooks

Example use cases

  • Create an alarm for instance CPU that uses resourceId filtering and a 5-minute sustained threshold to avoid spike alerts
  • Troubleshoot missing Compute or Function logs by verifying agent/service connector status and IAM policies
  • Replace a failing alarm that used equality (=) on sparse metrics with a dataMissing-aware MQL expression
  • Optimize a dashboard query by restricting CPUUtilization to specific instance OCIDs to reduce rate and improve relevance

FAQ

Why aren’t my metrics showing immediately?

OCI metrics are published every 1–5 minutes and can take 5–10 minutes to process; allow ~10–15 minutes before troubleshooting.

My alarm never fires—what’s a common cause?

Often the alarm query omits required dimensions or uses equality on sparse data. Include resourceId and treat missing data explicitly (dataMissing=zero or treatMissingDataAsBreaching) depending on intent.