home / skills / giuseppe-trisciuoglio / developer-kit / aws-cloudformation-cloudwatch
/plugins/developer-kit-aws/skills/aws-cloudformation/aws-cloudformation-cloudwatch
This skill helps you implement production-grade AWS CloudWatch monitoring with CloudFormation templates, covering metrics, alarms, dashboards, logs, and
npx playbooks add skill giuseppe-trisciuoglio/developer-kit --skill aws-cloudformation-cloudwatchReview the files below or copy the command above to add this skill to your agents.
---
name: aws-cloudformation-cloudwatch
description: Provides AWS CloudFormation patterns for CloudWatch monitoring, metrics, alarms, dashboards, logs, and observability. Use when creating CloudWatch metrics, alarms, dashboards, log groups, log subscriptions, anomaly detection, synthesized canaries, Application Signals, and implementing template structure with Parameters, Outputs, Mappings, Conditions, cross-stack references, and CloudWatch best practices for monitoring production infrastructure.
category: aws
tags: [aws, cloudformation, cloudwatch, monitoring, observability, metrics, alarms, logs, dashboards, infrastructure, iaac]
version: 1.0.0
allowed-tools: Read, Write, Bash
---
# AWS CloudFormation CloudWatch Monitoring
## Overview
Create production-ready monitoring and observability infrastructure using AWS CloudFormation templates. This skill covers CloudWatch metrics, alarms, dashboards, log groups, log insights, anomaly detection, synthesized canaries, Application Signals, and best practices for parameters, outputs, and cross-stack references.
## When to Use
Use this skill when:
- Creating custom CloudWatch metrics
- Configuring CloudWatch alarms for thresholds and anomaly detection
- Creating CloudWatch dashboards for multi-region visualization
- Implementing log groups with retention and encryption
- Configuring log subscriptions and cross-account log aggregation
- Implementing synthesized canaries for synthetic monitoring
- Enabling Application Signals for application performance monitoring
- Organizing templates with Parameters, Outputs, Mappings, Conditions
- Implementing cross-stack references with export/import
- Using Transform for macros and reuse
## Instructions
Follow these steps to create CloudWatch monitoring infrastructure with CloudFormation:
1. **Define Alarm Parameters**: Specify metric namespaces, dimensions, and threshold values
2. **Create CloudWatch Alarms**: Set up alarms for CPU, memory, disk, and custom metrics
3. **Configure Alarm Actions**: Define SNS topics for notification delivery
4. **Create Dashboards**: Build visualization widgets for metrics across resources
5. **Set Up Log Groups**: Configure retention policies and encryption settings
6. **Implement Anomaly Detection**: Enable ML-based detection for unusual patterns
7. **Add Synthesized Canaries**: Create CI/CD checks for critical endpoints
8. **Configure Composite Alarms**: Build multi-condition alarm logic
For complete examples, see the [EXAMPLES.md](references/examples.md) file.
## Examples
The following examples demonstrate common CloudWatch patterns:
### Example 1: CPU Utilization Alarm
```yaml
HighCpuAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-high-cpu"
AlarmDescription: Trigger when CPU utilization exceeds threshold
MetricName: CPUUtilization
Namespace: AWS/EC2
Dimensions:
- Name: InstanceId
Value: !Ref InstanceId
Statistic: Average
Period: 60
EvaluationPeriods: 3
Threshold: 80
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref SnsTopicArn
```
### Example 2: Dashboard with Multiple Widgets
```yaml
Dashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: !Sub "${AWS::StackName}-dashboard"
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0,
"width": 12, "height": 6,
"properties": {
"title": "CPU Utilization",
"metrics": [["AWS/EC2", "CPUUtilization", "InstanceId", "!Ref InstanceId"]],
"period": 300,
"stat": "Average",
"region": "!Ref AWS::Region"
}
}
]
}
```
### Example 3: Log Group with Retention
```yaml
ApplicationLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/ecs/${AWS::StackName}"
RetentionInDays: 30
KmsKeyId: !Ref KmsKeyArn
```
For complete production-ready examples, see [EXAMPLES.md](references/examples.md).
## CloudFormation Template Structure
### Base Template with Standard Format
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch monitoring and observability stack
Metadata:
AWS::CloudFormation::Interface:
ParameterGroups:
- Label:
default: Monitoring Configuration
Parameters:
- Environment
- LogRetentionDays
- EnableAnomalyDetection
- Label:
default: Alarm Thresholds
Parameters:
- ErrorRateThreshold
- LatencyThreshold
- CpuUtilizationThreshold
Parameters:
Environment:
Type: String
Default: dev
AllowedValues:
- dev
- staging
- production
Description: Deployment environment
LogRetentionDays:
Type: Number
Default: 30
AllowedValues:
- 1
- 3
- 5
- 7
- 14
- 30
- 60
- 90
- 120
- 150
- 180
- 365
- 400
- 545
- 731
- 1095
- 1827
- 2190
- 2555
- 2922
- 3285
- 3650
Description: Number of days to retain log events
EnableAnomalyDetection:
Type: String
Default: false
AllowedValues:
- true
- false
Description: Enable CloudWatch anomaly detection
ErrorRateThreshold:
Type: Number
Default: 5
Description: Error rate threshold for alarms (percentage)
LatencyThreshold:
Type: Number
Default: 1000
Description: Latency threshold in milliseconds
CpuUtilizationThreshold:
Type: Number
Default: 80
Description: CPU utilization threshold (percentage)
Mappings:
EnvironmentConfig:
dev:
LogRetentionDays: 7
ErrorRateThreshold: 10
LatencyThreshold: 2000
CpuUtilizationThreshold: 90
staging:
LogRetentionDays: 14
ErrorRateThreshold: 5
LatencyThreshold: 1500
CpuUtilizationThreshold: 85
production:
LogRetentionDays: 30
ErrorRateThreshold: 1
LatencyThreshold: 500
CpuUtilizationThreshold: 80
Conditions:
IsProduction: !Equals [!Ref Environment, production]
IsStaging: !Equals [!Ref Environment, staging]
EnableAnomaly: !Equals [!Ref EnableAnomalyDetection, true]
Transform:
- AWS::Serverless-2016-10-31
Resources:
# Log Group per applicazione
ApplicationLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/aws/applications/${Environment}/application"
RetentionInDays: !Ref LogRetentionDays
KmsKeyId: !Ref LogKmsKey
Tags:
- Key: Environment
Value: !Ref Environment
- Key: Application
Value: !Ref ApplicationName
Outputs:
LogGroupName:
Description: Name of the application log group
Value: !Ref ApplicationLogGroup
Export:
Name: !Sub "${AWS::StackName}-LogGroupName"
```
## Parameters Best Practices
### AWS-Specific Parameter Types
```yaml
Parameters:
# AWS-specific types for validation
CloudWatchNamespace:
Type: AWS::CloudWatch::Namespace
Description: CloudWatch metric namespace
AlarmActionArn:
Type: AWS::SNS::Topic::Arn
Description: SNS topic ARN for alarm actions
LogKmsKeyArn:
Type: AWS::KMS::Key::Arn
Description: KMS key ARN for log encryption
DashboardArn:
Type: AWS::CloudWatch::Dashboard::Arn
Description: Existing dashboard ARN to import
AnomalyDetectorArn:
Type: AWS::CloudWatch::AnomalyDetector::Arn
Description: Existing anomaly detector ARN
```
### Parameter Constraints
```yaml
Parameters:
MetricName:
Type: String
Description: CloudWatch metric name
ConstraintDescription: Must be 1-256 characters, alphanumeric, underscore, period, dash
MinLength: 1
MaxLength: 256
AllowedPattern: "[a-zA-Z0-9._-]+"
ThresholdValue:
Type: Number
Description: Alarm threshold value
MinValue: 0
MaxValue: 1000000000
EvaluationPeriods:
Type: Number
Description: Number of evaluation periods
Default: 5
MinValue: 1
MaxValue: 100
ConstraintDescription: Must be between 1 and 100
DatapointsToAlarm:
Type: Number
Description: Datapoints that must breach to trigger alarm
Default: 5
MinValue: 1
MaxValue: 10
Period:
Type: Number
Description: Metric period in seconds
Default: 300
AllowedValues:
- 10
- 30
- 60
- 300
- 900
- 3600
- 21600
- 86400
ComparisonOperator:
Type: String
Description: Alarm comparison operator
Default: GreaterThanThreshold
AllowedValues:
- GreaterThanThreshold
- GreaterThanOrEqualToThreshold
- LessThanThreshold
- LessThanOrEqualToThreshold
- GreaterThanUpperBound
- LessThanLowerBound
```
### SSM Parameter References
```yaml
Parameters:
AlarmTopicArn:
Type: AWS::SSM::Parameter::Value<AWS::SNS::Topic::Arn>
Default: /monitoring/alarms/topic-arn
Description: SNS topic ARN from SSM Parameter Store
DashboardConfig:
Type: AWS::SSM::Parameter::Value<String>
Default: /monitoring/dashboards/config
Description: Dashboard configuration from SSM
```
## Outputs and Cross-Stack References
### Export/Import Patterns
```yaml
# Stack A - Monitoring Stack
AWSTemplateFormatVersion: 2010-09-09
Description: Central monitoring infrastructure stack
Resources:
AlarmTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: !Sub "${AWS::StackName}-alarms"
DisplayName: !Sub "${AWS::StackName} Alarm Notifications"
LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/aws/monitoring/${AWS::StackName}"
RetentionInDays: 30
Outputs:
AlarmTopicArn:
Description: ARN of the alarm SNS topic
Value: !Ref AlarmTopic
Export:
Name: !Sub "${AWS::StackName}-AlarmTopicArn"
LogGroupName:
Description: Name of the log group
Value: !Ref LogGroup
Export:
Name: !Sub "${AWS::StackName}-LogGroupName"
LogGroupArn:
Description: ARN of the log group
Value: !GetAtt LogGroup.Arn
Export:
Name: !Sub "${AWS::StackName}-LogGroupArn"
```
```yaml
# Stack B - Application Stack (imports from Monitoring Stack)
AWSTemplateFormatVersion: 2010-09-09
Description: Application stack with monitoring integration
Parameters:
MonitoringStackName:
Type: String
Description: Name of the monitoring stack
Resources:
LambdaFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: !Sub "${AWS::StackName}-processor"
Runtime: python3.11
Handler: app.handler
Code:
S3Bucket: !Ref CodeBucket
S3Key: lambda/function.zip
Role: !GetAtt LambdaExecutionRole.Arn
ErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-errors"
AlarmDescription: Alert on Lambda errors
MetricName: Errors
Namespace: AWS/Lambda
Dimensions:
- Name: FunctionName
Value: !Ref LambdaFunction
Statistic: Sum
Period: 60
EvaluationPeriods: 5
Threshold: 1
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !ImportValue
!Sub "${MonitoringStackName}-AlarmTopicArn"
HighLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-latency"
AlarmDescription: Alert on high latency
MetricName: Duration
Namespace: AWS/Lambda
Dimensions:
- Name: FunctionName
Value: !Ref LambdaFunction
Statistic: P99
Period: 60
EvaluationPeriods: 3
Threshold: 5000
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !ImportValue
!Sub "${MonitoringStackName}-AlarmTopicArn"
```
### Nested Stacks for Modularity
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: Main stack with nested monitoring stacks
Resources:
# Nested stack for alarms
AlarmsStack:
Type: AWS::CloudFormation::Stack
Properties:
TemplateURL: https://s3.amazonaws.com/bucket/monitoring/alarms.yaml
TimeoutInMinutes: 15
Parameters:
Environment: !Ref Environment
AlarmTopicArn: !Ref AlarmTopicArn
# Nested stack for dashboards
DashboardsStack:
Type: AWS::CloudFormation::Stack
Properties:
TemplateURL: https://s3.amazonaws.com/bucket/monitoring/dashboards.yaml
TimeoutInMinutes: 15
Parameters:
Environment: !Ref Environment
LogGroupNames: !Join [",", [!GetAtt AlarmsStack.Outputs.LogGroupName]]
# Nested stack for log insights
LogInsightsStack:
Type: AWS::CloudFormation::Stack
Properties:
TemplateURL: https://s3.amazonaws.com/bucket/monitoring/log-insights.yaml
TimeoutInMinutes: 15
Parameters:
Environment: !Ref Environment
```
## CloudWatch Metrics and Alarms
### Base Metric Alarm
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch metric alarms
Resources:
# Error rate alarm
ErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-error-rate"
AlarmDescription: Alert when error rate exceeds threshold
MetricName: ErrorRate
Namespace: !Ref CustomNamespace
Dimensions:
- Name: Service
Value: !Ref ServiceName
- Name: Environment
Value: !Ref Environment
Statistic: Average
Period: 60
EvaluationPeriods: 5
DatapointsToAlarm: 3
Threshold: !Ref ErrorRateThreshold
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlarmTopic
InsufficientDataActions:
- !Ref AlarmTopic
OKActions:
- !Ref AlarmTopic
Tags:
- Key: Environment
Value: !Ref Environment
- Key: Severity
Value: high
# P99 latency alarm
LatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-p99-latency"
AlarmDescription: Alert when P99 latency exceeds threshold
MetricName: Latency
Namespace: !Ref CustomNamespace
Dimensions:
- Name: Service
Value: !Ref ServiceName
Statistic: p99
ExtendedStatistic: "p99"
Period: 60
EvaluationPeriods: 3
Threshold: !Ref LatencyThreshold
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions:
- !Ref AlarmTopic
# 4xx errors alarm
ClientErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-4xx-errors"
AlarmDescription: Alert on high 4xx error rate
MetricName: 4XXError
Namespace: AWS/ApiGateway
Dimensions:
- Name: ApiName
Value: !Ref ApiName
- Name: Stage
Value: !Ref StageName
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 100
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlarmTopic
# 5xx errors alarm
ServerErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-5xx-errors"
AlarmDescription: Alert on high 5xx error rate
MetricName: 5XXError
Namespace: AWS/ApiGateway
Dimensions:
- Name: ApiName
Value: !Ref ApiName
- Name: Stage
Value: !Ref StageName
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 10
ComparisonOperator: GreaterThanThreshold
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref AlarmTopic
```
### Composite Alarm
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch composite alarms
Resources:
# Base alarm for Lambda errors
LambdaErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-lambda-errors"
MetricName: Errors
Namespace: AWS/Lambda
Dimensions:
- Name: FunctionName
Value: !Ref LambdaFunction
Statistic: Sum
Period: 60
EvaluationPeriods: 5
Threshold: 5
ComparisonOperator: GreaterThanThreshold
# Base alarm for Lambda throttles
LambdaThrottleAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-lambda-throttles"
MetricName: Throttles
Namespace: AWS/Lambda
Dimensions:
- Name: FunctionName
Value: !Ref LambdaFunction
Statistic: Sum
Period: 60
EvaluationPeriods: 5
Threshold: 3
ComparisonOperator: GreaterThanThreshold
# Composite alarm combining both
LambdaHealthCompositeAlarm:
Type: AWS::CloudWatch::CompositeAlarm
Properties:
AlarmName: !Sub "${AWS::StackName}-lambda-health"
AlarmDescription: Composite alarm for Lambda function health
AlarmRule: !Or
- !Ref LambdaErrorAlarm
- !Ref LambdaThrottleAlarm
ActionsEnabled: true
AlarmActions:
- !Ref AlarmTopic
Tags:
- Key: Service
Value: lambda
- Key: Tier
Value: application
```
### Anomaly Detection Alarm
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch anomaly detection
Resources:
# Anomaly detector for metric
RequestRateAnomalyDetector:
Type: AWS::CloudWatch::AnomalyDetector
Properties:
MetricName: RequestCount
Namespace: !Ref CustomNamespace
Dimensions:
- Name: Service
Value: !Ref ServiceName
- Name: Environment
Value: !Ref Environment
Statistic: Sum
Configuration:
ExcludedTimeRanges:
- StartTime: "2023-12-25T00:00:00"
EndTime: "2023-12-26T00:00:00"
MetricTimeZone: UTC
# Alarm based on anomaly detection
AnomalyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-anomaly-detection"
AlarmDescription: Alert on anomalous metric behavior
MetricName: RequestCount
Namespace: !Ref CustomNamespace
Dimensions:
- Name: Service
Value: !Ref ServiceName
AnomalyDetectorConfiguration:
ExcludeTimeRange:
StartTime: "2023-12-25T00:00:00"
EndTime: "2023-12-26T00:00:00"
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 2
ComparisonOperator: GreaterThanUpperThreshold
AlarmActions:
- !Ref AlarmTopic
# Alarm for low anomalous value
LowTrafficAnomalyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-low-traffic"
AlarmDescription: Alert on unusually low traffic
MetricName: RequestCount
Namespace: !Ref CustomNamespace
Dimensions:
- Name: Service
Value: !Ref ServiceName
AnomalyDetectorConfiguration:
Bound: Lower
Statistic: Sum
Period: 300
EvaluationPeriods: 3
Threshold: 0.5
ComparisonOperator: LessThanLowerThreshold
AlarmActions:
- !Ref AlarmTopic
```
## CloudWatch Dashboards
### Dashboard Base
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch dashboard
Resources:
# Main dashboard
MainDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: !Sub "${AWS::StackName}-main"
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"x": 0,
"y": 0,
"width": 12,
"height": 6,
"properties": {
"title": "API Gateway Requests",
"view": "timeSeries",
"stacked": false,
"region": "${AWS::Region}",
"metrics": [
["AWS/ApiGateway", "Count", "ApiName", "${ApiName}", "Stage", "${StageName}"],
[".", "4XXError", ".", ".", ".", "."],
[".", "5XXError", ".", ".", ".", "."]
],
"period": 300,
"stat": "Sum"
}
},
{
"type": "metric",
"x": 12,
"y": 0,
"width": 12,
"height": 6,
"properties": {
"title": "API Gateway Latency",
"view": "timeSeries",
"region": "${AWS::Region}",
"metrics": [
["AWS/ApiGateway", "Latency", "ApiName", "${ApiName}", "Stage", "${StageName}", {"stat": "p99"}],
[".", ".", ".", ".", ".", ".", {"stat": "Average"}]
],
"period": 300
}
},
{
"type": "metric",
"x": 0,
"y": 6,
"width": 12,
"height": 6,
"properties": {
"title": "Lambda Invocations",
"view": "timeSeries",
"region": "${AWS::Region}",
"metrics": [
["AWS/Lambda", "Invocations", "FunctionName", "${LambdaFunction}"],
[".", "Errors", ".", "."],
[".", "Throttles", ".", "."]
],
"period": 60,
"stat": "Sum"
}
},
{
"type": "metric",
"x": 12,
"y": 6,
"width": 12,
"height": 6,
"properties": {
"title": "Lambda Duration",
"view": "timeSeries",
"region": "${AWS::Region}",
"metrics": [
["AWS/Lambda", "Duration", "FunctionName", "${LambdaFunction}", {"stat": "p99"}],
[".", ".", ".", ".", {"stat": "Average"}],
[".", ".", ".", ".", {"stat": "Maximum"}]
],
"period": 60
}
},
{
"type": "log",
"x": 0,
"y": 12,
"width": 24,
"height": 6,
"properties": {
"title": "Application Logs",
"view": "table",
"region": "${AWS::Region}",
"logGroupName": "${ApplicationLogGroup}",
"timeRange": {
"type": "relative",
"from": 3600
},
"filterPattern": "ERROR | WARN"
}
}
]
}
# Dashboard for specific service
ServiceDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: !Sub "${AWS::StackName}-${ServiceName}"
DashboardBody: !Sub |
{
"start": "-PT6H",
"widgets": [
{
"type": "text",
"x": 0,
"y": 0,
"width": 24,
"height": 1,
"properties": {
"markdown": "# ${ServiceName} - ${Environment} Dashboard"
}
},
{
"type": "metric",
"x": 0,
"y": 1,
"width": 8,
"height": 6,
"properties": {
"title": "Request Rate",
"view": "timeSeries",
"stacked": false,
"region": "${AWS::Region}",
"metrics": [
["${CustomNamespace}", "RequestCount", "Service", "${ServiceName}", "Environment", "${Environment}"]
],
"period": 60,
"stat": "Sum"
}
},
{
"type": "metric",
"x": 8,
"y": 1,
"width": 8,
"height": 6,
"properties": {
"title": "Error Rate %",
"view": "timeSeries",
"region": "${AWS::Region}",
"metrics": [
["${CustomNamespace}", "ErrorCount", "Service", "${ServiceName}"],
[".", "RequestCount", ".", "."],
[".", "SuccessCount", ".", "."]
],
"period": 60,
"stat": "Average"
}
},
{
"type": "metric",
"x": 16,
"y": 1,
"width": 8,
"height": 6,
"properties": {
"title": "P99 Latency",
"view": "timeSeries",
"region": "${AWS::Region}",
"metrics": [
["${CustomNamespace}", "Latency", "Service", "${ServiceName}"]
],
"period": 60,
"stat": "p99"
}
}
]
}
```
## CloudWatch Logs
### Log Group Configurations
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch log groups configuration
Parameters:
LogRetentionDays:
Type: Number
Default: 30
AllowedValues:
- 1
- 3
- 5
- 7
- 14
- 30
- 60
- 90
- 120
- 150
- 180
- 365
- 400
- 545
- 731
- 1095
- 1827
- 2190
- 2555
- 2922
- 3285
- 3650
Resources:
# Application log group
ApplicationLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/aws/applications/${Environment}/${ApplicationName}"
RetentionInDays: !Ref LogRetentionDays
KmsKeyId: !Ref LogKmsKeyArn
Tags:
- Key: Environment
Value: !Ref Environment
- Key: Application
Value: !Ref ApplicationName
- Key: Service
Value: !Ref ServiceName
# Lambda log group
LambdaLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/aws/lambda/${LambdaFunctionName}"
RetentionInDays: !Ref LogRetentionDays
KmsKeyId: !Ref LogKmsKeyArn
# Subscription filter for Log Insights
LogSubscriptionFilter:
Type: AWS::Logs::SubscriptionFilter
Properties:
DestinationArn: !GetAtt LogDestination.Arn
FilterPattern: '[timestamp=*Z, request_id, level, message]'
LogGroupName: !Ref ApplicationLogGroup
RoleArn: !GetAtt LogSubscriptionRole.Arn
# Metric filter for errors
ErrorMetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
FilterPattern: '[level="ERROR", msg]'
LogGroupName: !Ref ApplicationLogGroup
MetricTransformations:
- MetricValue: "1"
MetricNamespace: !Sub "${AWS::StackName}/Application"
MetricName: ErrorCount
- MetricValue: "$level"
MetricNamespace: !Sub "${AWS::StackName}/Application"
MetricName: LogLevel
# Metric filter for warnings
WarningMetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
FilterPattern: '[level="WARN", msg]'
LogGroupName: !Ref ApplicationLogGroup
MetricTransformations:
- MetricValue: "1"
MetricNamespace: !Sub "${AWS::StackName}/Application"
MetricName: WarningCount
# Log group with custom retention
AuditLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/aws/audit/${Environment}/${ApplicationName}"
RetentionInDays: 365
KmsKeyId: !Ref LogKmsKeyArn
```
### Log Insights Query
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch Logs Insights queries
Resources:
# Query definition for recent errors
RecentErrorsQuery:
Type: AWS::Logs::QueryDefinition
Properties:
Name: !Sub "${AWS::StackName}-recent-errors"
QueryString: |
fields @timestamp, @message
| sort @timestamp desc
| limit 100
| filter @message like /ERROR/
| display @timestamp, @message, @logStream
# Query for performance analysis
PerformanceQuery:
Type: AWS::Logs::QueryDefinition
Properties:
Name: !Sub "${AWS::StackName}-performance"
QueryString: |
fields @timestamp, @message, @duration
| filter @duration > 1000
| sort @duration desc
| limit 50
| display @timestamp, @duration, @message
```
## Synthesized Canaries
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch Synthesized Canaries
Parameters:
CanarySchedule:
Type: String
Default: rate(5 minutes)
Description: Schedule expression for canary
Resources:
# Canary for API endpoint
ApiCanary:
Type: AWS::Synthetics::Canary
Properties:
Name: !Sub "${AWS::StackName}-api-check"
ArtifactS3Location: !Sub "s3://${ArtifactBucket}/canary/${AWS::StackName}"
Code:
S3Bucket: !Ref CanariesCodeBucket
S3Key: canary/api-check.zip
Handler: apiCheck.handler
ExecutionRoleArn: !GetAtt CanaryRole.Arn
RuntimeVersion: syn-python-selenium-1.1
Schedule:
Expression: !Ref CanarySchedule
DurationInSeconds: 120
SuccessRetentionPeriodInDays: 31
FailureRetentionPeriodInDays: 31
Tags:
- Key: Environment
Value: !Ref Environment
- Key: Service
Value: api
# Alarm for canary failure
CanaryFailureAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-canary-failed"
AlarmDescription: Alert when synthesized canary fails
MetricName: Failed
Namespace: AWS/Synthetics
Dimensions:
- Name: CanaryName
Value: !Ref ApiCanary
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref AlarmTopic
# Alarm for canary latency
CanaryLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-canary-slow"
AlarmDescription: Alert when canary latency is high
MetricName: Duration
Namespace: AWS/Synthetics
Dimensions:
- Name: CanaryName
Value: !Ref ApiCanary
Statistic: p99
Period: 300
EvaluationPeriods: 3
Threshold: 5000
ComparisonOperator: GreaterThanThreshold
CanaryRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub "${AWS::StackName}-canary-role"
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service: synthetics.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: SyntheticsLeastPrivilege
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- synthetics:DescribeCanaries
- synthetics:DescribeCanaryRuns
- synthetics:GetCanary
- synthetics:ListTagsForResource
Resource: "*"
- Effect: Allow
Action:
- synthetics:StartCanary
- synthetics:StopCanary
Resource: !Ref ApiCanary
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
- logs:DescribeLogStreams
Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/cw-syn-canary-*"
- Effect: Allow
Action:
- s3:PutObject
- s3:GetObject
Resource: !Sub "s3://${ArtifactBucket}/canary/${AWS::StackName}/*"
- Effect: Allow
Action:
- kms:Decrypt
Resource: !Ref KmsKeyArn
Condition:
StringEquals:
kms:ViaService: !Sub "s3.${AWS::Region}.amazonaws.com"
```
## CloudWatch Application Signals
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch Application Signals for APM
Resources:
# Service level indicator for availability
AvailabilitySLI:
Type: AWS::CloudWatch::ServiceLevelObjective
Properties:
Name: !Sub "${AWS::StackName}-availability"
Description: Service level objective for availability
Monitor:
MonitorName: !Sub "${AWS::StackName}-monitor"
MonitorType: AWS_SERVICE_LEVEL_INDICATOR
ResourceGroup: !Ref ResourceGroup
SliMetric:
MetricName: Availability
Namespace: !Sub "${AWS::StackName}/Application"
Dimensions:
- Name: Service
Value: !Ref ServiceName
Target:
ComparisonOperator: GREATER_THAN_OR_EQUAL
Threshold: 99.9
Period:
RollingInterval:
Count: 1
TimeUnit: HOUR
Goal:
TargetLevel: 99.9
# Service level indicator for latency
LatencySLI:
Type: AWS::CloudWatch::ServiceLevelIndicator
Properties:
Name: !Sub "${AWS::StackName}-latency-sli"
Monitor:
MonitorName: !Sub "${AWS::StackName}-monitor"
Metric:
MetricName: Latency
Namespace: !Sub "${AWS::StackName}/Application"
Dimensions:
- Name: Service
Value: !Ref ServiceName
OperationName: GetItem
AccountId: !Ref AWS::AccountId
# Monitor for application performance
ApplicationMonitor:
Type: AWS::CloudWatch::ApplicationMonitor
Properties:
MonitorName: !Sub "${AWS::StackName}-app-monitor"
MonitorType: CW_MONITOR
Telemetry:
- Type: APM
Config:
Eps: 100
```
## Conditions and Transform
### Conditions for Environment-Specific Resources
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch with conditional resources
Parameters:
Environment:
Type: String
Default: dev
AllowedValues:
- dev
- staging
- production
Description: Deployment environment
Conditions:
IsProduction: !Equals [!Ref Environment, production]
IsStaging: !Equals [!Ref Environment, staging]
CreateAnomalyDetection: !Or [!Equals [!Ref Environment, staging], !Equals [!Ref Environment, production]]
CreateSLI: !Equals [!Ref Environment, production]
Resources:
# Base alarm for all environments
BaseAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-errors"
MetricName: Errors
Namespace: !Ref CustomNamespace
Dimensions:
- Name: Service
Value: !Ref ServiceName
Statistic: Sum
Period: 60
EvaluationPeriods: 5
Threshold: 10
ComparisonOperator: GreaterThanThreshold
# Alarm with different thresholds for production
ProductionAlarm:
Type: AWS::CloudWatch::Alarm
Condition: IsProduction
Properties:
AlarmName: !Sub "${AWS::StackName}-errors-production"
MetricName: Errors
Namespace: !Ref CustomNamespace
Dimensions:
- Name: Service
Value: !Ref ServiceName
Statistic: Sum
Period: 60
EvaluationPeriods: 3
Threshold: 1
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref ProductionAlarmTopic
# Anomaly detector only for staging and production
AnomalyDetector:
Type: AWS::CloudWatch::AnomalyDetector
Condition: CreateAnomalyDetection
Properties:
MetricName: RequestCount
Namespace: !Ref CustomNamespace
Dimensions:
- Name: Service
Value: !Ref ServiceName
Statistic: Sum
# SLI only for production
ServiceLevelIndicator:
Type: AWS::CloudWatch::ServiceLevelIndicator
Condition: CreateSLI
Properties:
Name: !Sub "${AWS::StackName}-sli"
Monitor:
MonitorName: !Sub "${AWS::StackName}-monitor"
Metric:
MetricName: Availability
Namespace: !Sub "${AWS::StackName}/Application"
```
### Transform for Code Reuse
```yaml
AWSTemplateFormatVersion: 2010-09-09
Transform: AWS::Serverless-2016-10-31
Description: Using SAM Transform for CloudWatch resources
Globals:
Function:
Timeout: 30
Runtime: python3.11
Environment:
Variables:
LOG_LEVEL: INFO
LoggingConfiguration:
LogGroup:
Name: !Sub "/aws/lambda/${FunctionName}"
RetentionInDays: 30
Resources:
# Lambda function with automatic logging
MonitoredFunction:
Type: AWS::Serverless::Function
Properties:
FunctionName: !Sub "${AWS::StackName}-monitored"
Handler: app.handler
CodeUri: functions/monitored/
Policies:
- PolicyName: LogsLeastPrivilege
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
- logs:DescribeLogStreams
- logs:GetLogEvents
- logs:FilterLogEvents
Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/${AWS::StackName}-*"
- Effect: Allow
Action:
- logs:DescribeLogGroups
Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*"
Events:
Api:
Type: Api
Properties:
Path: /health
Method: get
```
## Best Practices
### Security
- Encrypt log groups with KMS keys
- Use resource-based policies for log access
- Implement cross-account log aggregation with proper IAM
- Configure log retention appropriate for compliance
- Use VPC endpoints for CloudWatch to isolate traffic
- Implement least privilege for IAM roles
### Performance
- Use appropriate metric periods (60s for alarms, 300s for dashboards)
- Implement composite alarms to reduce alarm fatigue
- Use anomaly detection for non-linear patterns
- Configure dashboards with efficient widgets
- Limit retention period for log groups
### Monitoring
- Implement SLI/SLO for service health
- Use multi-region dashboards for global applications
- Configure alarms with proper evaluation periods
- Implement canaries for synthetic monitoring
- Use Application Signals for APM
### Deployment
- Use change sets before deployment
- Test templates with cfn-lint
- Organize stacks by ownership (network, app, data)
- Use nested stacks for modularity
- Implement stack policies for protection
## CloudFormation Best Practices
### Stack Policies
Stack policies protect critical resources from unintentional updates. Use them to prevent modifications to production resources.
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch stack with protection policies
Resources:
CriticalAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${AWS::StackName}-critical"
MetricName: Errors
Namespace: AWS/Lambda
Statistic: Sum
Period: 60
EvaluationPeriods: 5
Threshold: 1
ComparisonOperator: GreaterThanThreshold
Metadata:
AWS::CloudFormation::StackPolicy:
Statement:
- Effect: Deny
Principal: "*"
Action:
- Update:Delete
- Update:Modify
Resource: "*"
- Effect: Allow
Principal: "*"
Action:
- Update:Modify
Resource: "*"
Condition:
StringEquals:
aws:RequestedOperation:
- Describe*
- List*
```
### Termination Protection
Enable termination protection to prevent accidental stack deletion, especially for production monitoring stacks.
**Via Console:**
1. Select the stack
2. Go to Stack actions > Change termination protection
3. Enable termination protection
**Via CLI:**
```bash
aws cloudformation update-termination-protection \
--stack-name my-monitoring-stack \
--enable-termination-protection
```
**Via CloudFormation (Stack Set):**
```yaml
Resources:
MonitoringStack:
Type: AWS::CloudFormation::Stack
Properties:
TemplateURL: !Sub "https://${BucketName}.s3.amazonaws.com/monitoring.yaml"
TerminationProtection: true
```
### Drift Detection
Detect when actual infrastructure differs from the CloudFormation template.
**Detect drift on a single stack:**
```bash
aws cloudformation detect-drift \
--stack-name my-monitoring-stack
```
**Get drift detection status:**
```bash
aws cloudFormation describe-stack-drift-detection-process-status \
--stack-drift-detection-id <detection-id>
```
**Get resources that have drifted:**
```bash
aws cloudformation list-stack-resources \
--stack-name my-monitoring-stack \
--query "StackResourceSummaries[?StackResourceDriftStatus!='IN_SYNC']"
```
**Automation with Lambda:**
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: Automated drift detection scheduler
Resources:
DriftDetectionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Policies:
- PolicyName: CloudWatchDrift
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- cloudformation:DetectStackDrift
- cloudformation:DescribeStacks
- cloudformation:ListStackResources
Resource: "*"
- Effect: Allow
Action:
- sns:Publish
Resource: !Ref AlertTopic
DriftDetectionFunction:
Type: AWS::Lambda::Function
Properties:
Runtime: python3.11
Handler: drift_detector.handler
Code:
S3Bucket: !Ref CodeBucket
S3Key: functions/drift-detector.zip
Role: !GetAtt DriftDetectionRole.Arn
Environment:
Variables:
SNS_TOPIC_ARN: !Ref AlertTopic
DriftDetectionRule:
Type: AWS::Events::Rule
Properties:
ScheduleExpression: "rate(1 day)"
Targets:
- Id: DriftDetection
Arn: !GetAtt DriftDetectionFunction.Arn
AlertTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: !Sub "${AWS::StackName}-drift-alerts"
```
### Change Sets
Use change sets to preview and review changes before applying them.
**Create change set:**
```bash
aws cloudformation create-change-set \
--stack-name my-monitoring-stack \
--template-body file://updated-template.yaml \
--change-set-name my-changeset \
--capabilities CAPABILITY_IAM
```
**List change sets:**
```bash
aws cloudformation list-change-sets \
--stack-name my-monitoring-stack
```
**Describe change set:**
```bash
aws cloudformation describe-change-set \
--stack-name my-monitoring-stack \
--change-set-name my-changeset
```
**Execute change set:**
```bash
aws cloudformation execute-change-set \
--stack-name my-monitoring-stack \
--change-set-name my-changeset
```
**Pipeline integration:**
```yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CI/CD pipeline for CloudWatch stacks
Resources:
Pipeline:
Type: AWS::CodePipeline::Pipeline
Properties:
Name: !Sub "${AWS::StackName}-pipeline"
RoleArn: !GetAtt PipelineRole.Arn
Stages:
- Name: Source
Actions:
- Name: SourceAction
ActionTypeId:
Category: Source
Owner: AWS
Provider: CodeCommit
Version: "1"
Configuration:
RepositoryName: !Ref RepositoryName
BranchName: main
OutputArtifacts:
- Name: SourceOutput
- Name: Validate
Actions:
- Name: ValidateTemplate
ActionTypeId:
Category: Test
Owner: AWS
Provider: CloudFormation
Version: "1"
Configuration:
ActionMode: VALIDATE_ONLY
TemplatePath: SourceOutput::template.yaml
InputArtifacts:
- Name: SourceOutput
- Name: Review
Actions:
- Name: CreateChangeSet
ActionTypeId:
Category: Deploy
Owner: AWS
Provider: CloudFormation
Version: "1"
Configuration:
ActionMode: CHANGE_SET_REPLACE
StackName: !Ref StackName
ChangeSetName: !Sub "${StackName}-changeset"
TemplatePath: SourceOutput::template.yaml
Capabilities: CAPABILITY_IAM,CAPABILITY_NAMED_IAM
InputArtifacts:
- Name: SourceOutput
- Name: Approval
ActionTypeId:
Category: Approval
Owner: AWS
Provider: Manual
Version: "1"
Configuration:
CustomData: Review changes before deployment
- Name: Deploy
Actions:
- Name: ExecuteChangeSet
ActionTypeId:
Category: Deploy
Owner: AWS
Provider: CloudFormation
Version: "1"
Configuration:
ActionMode: CHANGE_SET_EXECUTE
StackName: !Ref StackName
ChangeSetName: !Sub "${StackName}-changeset"
PipelineRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service: codepipeline.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: PipelinePolicy
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- codecommit:Get*
- codecommit:List*
- codecommit:BatchGet*
Resource: "*"
- Effect: Allow
Action:
- s3:GetObject
- s3:PutObject
Resource: !Sub "arn:aws:s3:::${ArtifactBucket}/*"
- Effect: Allow
Action:
- cloudformation:*
- iam:PassRole
Resource: "*"
- Effect: Allow
Action:
- sns:Publish
Resource: !Ref ApprovalTopic
ApprovalTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: !Sub "${AWS::StackName}-approval"
```
## Related Resources
- [CloudWatch Documentation](https://docs.aws.amazon.com/cloudwatch/)
- [AWS CloudFormation User Guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/)
- [CloudWatch Best Practices](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_best_practices.html)
- [Service Level Indicators](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ServiceLevelIndicators.html)
- [CloudFormation Drift Detection](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-stack-drift.html)
- [CloudFormation Change Sets](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-updating-stacks-changesets.html)
## Constraints and Warnings
### Resource Limits
- **Alarms Limits**: Maximum 5000 CloudWatch alarms per AWS account per region
- **Dashboards Limits**: Maximum 500 dashboards per account per region
- **Metrics Limits**: Each dashboard can contain up to 100 metrics
- **Log Groups Limits**: Maximum number of log groups per account is virtually unlimited but retention has cost implications
### Operational Constraints
- **Metric Resolution**: Some metrics have 1-minute minimum resolution; higher resolution costs more
- **Alarm Evaluation Periods**: Alarms with too few evaluation periods may trigger false positives
- **Dashboard Widgets**: Each dashboard is limited to 500 widgets
- **Cross-Account Metrics**: Cross-account sharing requires explicit resource policies
### Security Constraints
- **Log Data Access**: CloudWatch Logs may contain sensitive information
- **Encryption**: KMS keys required for encrypted log groups incur additional costs
- **Metric Filters**: Metric filters count as separate billable CloudWatch Logs operations
- **Alarm Actions**: SNS topics used for alarm actions must have appropriate permissions
### Cost Considerations
- **Detailed Monitoring**: Enabling detailed monitoring doubles the number of metrics for EC2
- **Metric Storage**: Custom metrics stored in CloudWatch incur costs based on resolution and retention
- **Log Retention**: Longer retention periods significantly increase storage costs
- **Dashboards**: Dashboard widgets don't directly cost but each metric queried does
### Data Constraints
- **Metric Age**: Metrics are retained for 15 months by default; high-resolution metrics have shorter retention
- **Log Ingestion**: Large log volumes can impact ingestion latency and query performance
- **Metric Filters**: Metric filters have limits on number of filters and matching patterns
## Additional Files
For complete details on resources and their properties, consult:
- [REFERENCE.md](references/reference.md) - Detailed reference guide for all CloudFormation resources
- [EXAMPLES.md](references/examples.md) - Complete production-ready examples
This skill provides production-ready AWS CloudFormation patterns for CloudWatch monitoring and observability. It packages reusable templates and conventions for metrics, alarms, dashboards, log groups, anomaly detection, synthesized canaries, and Application Signals. Use it to enforce consistent monitoring across environments and to simplify cross-stack integrations and exports.
The skill supplies CloudFormation constructs and example templates that create CloudWatch metrics, alarms, dashboards, log groups, anomaly detectors, and canaries. It defines parameterized building blocks (Parameters, Mappings, Conditions, Outputs) so templates can be reused across dev, staging, and production. The patterns include SNS actions, SSM-backed parameters, encryption and retention settings, composite alarms, and cross-stack export/import wiring for centralized monitoring.
Can I reuse the templates across multiple AWS accounts?
Yes — use cross-account log subscriptions and export/import patterns. Centralize exports in a monitoring stack and grant required permissions for imports.
How do I limit noise from alarms in non-production?
Parameterize thresholds and enable conditions to disable or relax alarms for dev and staging, and use mapping defaults per environment.