home / skills / jeremylongshore / claude-code-plugins-plus-skills / infrastructure-metrics-collector

This skill helps you collect infrastructure metrics across compute, storage, network, and databases, configure agents, and build dashboards for real-time

npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill infrastructure-metrics-collector

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
3.1 KB
---
name: collecting-infrastructure-metrics
description: |
  This skill enables Claude to collect comprehensive infrastructure performance metrics across compute, storage, network, containers, load balancers, and databases. It is triggered when the user requests "collect infrastructure metrics", "monitor server performance", "set up performance dashboards", or needs to analyze system resource utilization. The skill configures metrics collection, sets up aggregation, and helps create infrastructure dashboards for health monitoring and capacity tracking. It supports configuration for Prometheus, Datadog, and CloudWatch.
---

## Overview

This skill automates the process of setting up infrastructure metrics collection. It identifies key performance indicators (KPIs) across various infrastructure layers, configures agents to collect these metrics, and assists in setting up central aggregation and visualization.

## How It Works

1. **Identify Infrastructure Layers**: Determines the infrastructure layers to monitor (compute, storage, network, containers, load balancers, databases).
2. **Configure Metrics Collection**: Sets up agents (Prometheus, Datadog, CloudWatch) to collect metrics from the identified layers.
3. **Aggregate Metrics**: Configures central aggregation of the collected metrics for analysis and visualization.
4. **Create Dashboards**: Generates infrastructure dashboards for health monitoring, performance analysis, and capacity tracking.

## When to Use This Skill

This skill activates when you need to:
- Monitor the performance of your infrastructure.
- Identify bottlenecks in your system.
- Set up dashboards for real-time monitoring.

## Examples

### Example 1: Setting up basic monitoring

User request: "Collect infrastructure metrics for my web server."

The skill will:
1. Identify compute, storage, and network layers relevant to the web server.
2. Configure Prometheus to collect CPU, memory, disk I/O, and network bandwidth metrics.

### Example 2: Troubleshooting database performance

User request: "I'm seeing slow database queries. Can you help me monitor the database performance?"

The skill will:
1. Identify the database layer and relevant metrics such as connection pool usage, replication lag, and cache hit rates.
2. Configure Datadog to collect these metrics and create a dashboard to visualize performance trends.

## Best Practices

- **Agent Selection**: Choose the appropriate agent (Prometheus, Datadog, CloudWatch) based on your existing infrastructure and monitoring tools.
- **Metric Granularity**: Balance the granularity of metrics collection with the storage and processing overhead. Collect only the essential metrics for your use case.
- **Alerting**: Configure alerts based on thresholds for key metrics to proactively identify and address performance issues.

## Integration

This skill can be integrated with other Claude Code plugins for deployment, configuration management, and alerting to provide a comprehensive infrastructure management solution. For example, it can be used with a deployment plugin to automatically configure metrics collection after deploying new infrastructure.

Overview

This skill automates collection and aggregation of infrastructure performance metrics across compute, storage, network, containers, load balancers, and databases. It helps configure monitoring agents, centralizes metrics, and generates dashboards for health, performance analysis, and capacity planning. The skill supports Prometheus, Datadog, and CloudWatch for flexible deployment in cloud and on-prem environments.

How this skill works

The skill first identifies the infrastructure layers and key KPIs to monitor for your environment (CPU, memory, disk I/O, network, container and DB-specific metrics). It then generates and applies configuration for the chosen agent (Prometheus, Datadog, or CloudWatch) to collect those metrics. Collected data is routed to a central aggregator and the skill scaffolds dashboards and visualizations for real-time monitoring and trend analysis. It can also suggest alert thresholds and basic aggregation rules for capacity tracking.

When to use it

  • You need to monitor server, container, or database performance in real time.
  • You want to set up centralized dashboards after deploying new infrastructure.
  • You need to identify or troubleshoot performance bottlenecks across layers.
  • You want to implement capacity planning and long-term trend analysis.
  • You’re standardizing monitoring across multiple teams or environments.

Best practices

  • Choose the agent (Prometheus, Datadog, CloudWatch) that aligns with your existing tooling and permissions model.
  • Limit collected metrics to essential KPIs to reduce storage and processing costs.
  • Set sensible scrape or collection intervals to balance granularity with overhead.
  • Define alerting thresholds tied to service-level objectives, not arbitrary values.
  • Tag metrics consistently (service, environment, region) to enable useful aggregation and filtering.

Example use cases

  • Set up Prometheus scraping for a Kubernetes cluster to gather node, pod, and container metrics and build Grafana dashboards.
  • Configure Datadog to monitor database connection pools, query latency, and cache hit rates, and create an incident dashboard.
  • Use CloudWatch templates to collect EC2, EBS, and ELB metrics and generate capacity planning reports.
  • Identify a network bottleneck by correlating bandwidth metrics with application latency across load balancers and backend servers.
  • Onboard a new service by automatically applying agent configuration and adding it to central dashboards and alert policies.

FAQ

Which monitoring agents are supported?

Prometheus, Datadog, and CloudWatch are supported; agent selection depends on your environment and existing tooling.

Can the skill create alerting rules?

Yes — it suggests and can scaffold basic alert thresholds tied to key metrics, but fine-tuning should be done with operational owners.