home / skills / itsmostafa / aws-agent-skills / ecs

ecs skill

/skills/ecs

This skill helps you deploy and manage containerized applications on AWS ECS, including clusters, task definitions, services, and troubleshooting.

npx playbooks add skill itsmostafa/aws-agent-skills --skill ecs

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
9.2 KB
---
name: ecs
description: AWS ECS container orchestration for running Docker containers. Use when deploying containerized applications, configuring task definitions, setting up services, managing clusters, or troubleshooting container issues.
last_updated: "2026-01-07"
doc_source: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/
---

# AWS ECS

Amazon Elastic Container Service (ECS) is a fully managed container orchestration service. Run containers on AWS Fargate (serverless) or EC2 instances.

## Table of Contents

- [Core Concepts](#core-concepts)
- [Common Patterns](#common-patterns)
- [CLI Reference](#cli-reference)
- [Best Practices](#best-practices)
- [Troubleshooting](#troubleshooting)
- [References](#references)

## Core Concepts

### Cluster

Logical grouping of tasks or services. Can contain Fargate tasks, EC2 instances, or both.

### Task Definition

Blueprint for your application. Defines containers, resources, networking, and IAM roles.

### Task

Running instance of a task definition. Can run standalone or as part of a service.

### Service

Maintains desired count of tasks. Handles deployments, load balancing, and auto scaling.

### Launch Types

| Type | Description | Use Case |
|------|-------------|----------|
| **Fargate** | Serverless, pay per task | Most workloads |
| **EC2** | Self-managed instances | GPU, Windows, specific requirements |

## Common Patterns

### Create a Fargate Cluster

**AWS CLI:**

```bash
# Create cluster
aws ecs create-cluster --cluster-name my-cluster

# With capacity providers
aws ecs create-cluster \
  --cluster-name my-cluster \
  --capacity-providers FARGATE FARGATE_SPOT \
  --default-capacity-provider-strategy \
    capacityProvider=FARGATE,weight=1 \
    capacityProvider=FARGATE_SPOT,weight=1
```

### Register Task Definition

```bash
cat > task-definition.json << 'EOF'
{
  "family": "web-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/ecsTaskRole",
  "containerDefinitions": [
    {
      "name": "web",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest",
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {"name": "NODE_ENV", "value": "production"}
      ],
      "secrets": [
        {
          "name": "DB_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:db-password"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/web-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}
EOF

aws ecs register-task-definition --cli-input-json file://task-definition.json
```

### Create Service with Load Balancer

```bash
aws ecs create-service \
  --cluster my-cluster \
  --service-name web-service \
  --task-definition web-app:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
    subnets=[subnet-12345678,subnet-87654321],
    securityGroups=[sg-12345678],
    assignPublicIp=DISABLED
  }" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/web-tg/1234567890123456,containerName=web,containerPort=8080" \
  --health-check-grace-period-seconds 60
```

### Run Standalone Task

```bash
aws ecs run-task \
  --cluster my-cluster \
  --task-definition my-batch-job:1 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
    subnets=[subnet-12345678],
    securityGroups=[sg-12345678],
    assignPublicIp=ENABLED
  }"
```

### Update Service (Deploy New Image)

```bash
# Register new task definition with updated image
aws ecs register-task-definition --cli-input-json file://task-definition.json

# Update service to use new version
aws ecs update-service \
  --cluster my-cluster \
  --service web-service \
  --task-definition web-app:2 \
  --force-new-deployment
```

### Auto Scaling

```bash
# Register scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/my-cluster/web-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 10

# Target tracking policy
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/my-cluster/web-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 120
  }'
```

## CLI Reference

### Cluster Management

| Command | Description |
|---------|-------------|
| `aws ecs create-cluster` | Create cluster |
| `aws ecs describe-clusters` | Get cluster details |
| `aws ecs list-clusters` | List clusters |
| `aws ecs delete-cluster` | Delete cluster |

### Task Definitions

| Command | Description |
|---------|-------------|
| `aws ecs register-task-definition` | Create task definition |
| `aws ecs describe-task-definition` | Get task definition |
| `aws ecs list-task-definitions` | List task definitions |
| `aws ecs deregister-task-definition` | Deregister version |

### Services

| Command | Description |
|---------|-------------|
| `aws ecs create-service` | Create service |
| `aws ecs update-service` | Update service |
| `aws ecs describe-services` | Get service details |
| `aws ecs delete-service` | Delete service |

### Tasks

| Command | Description |
|---------|-------------|
| `aws ecs run-task` | Run standalone task |
| `aws ecs stop-task` | Stop running task |
| `aws ecs describe-tasks` | Get task details |
| `aws ecs list-tasks` | List tasks |

## Best Practices

### Security

- **Use task roles** for AWS API access (not access keys)
- **Use execution roles** for ECR/Secrets access
- **Store secrets in Secrets Manager** or Parameter Store
- **Use private subnets** with NAT gateway
- **Enable CloudTrail** for API auditing

### Performance

- **Right-size CPU/memory** — monitor and adjust
- **Use Fargate Spot** for fault-tolerant workloads (70% savings)
- **Enable container insights** for monitoring
- **Use service discovery** for internal communication

### Reliability

- **Deploy across multiple AZs**
- **Configure health checks** properly
- **Set appropriate deregistration delay**
- **Use circuit breaker** for deployments

```bash
aws ecs update-service \
  --cluster my-cluster \
  --service web-service \
  --deployment-configuration '{
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  }'
```

### Cost Optimization

- **Use Fargate Spot** for batch workloads
- **Right-size task resources**
- **Scale to zero** when not needed
- **Use capacity providers** for mixed Fargate/Spot

## Troubleshooting

### Task Fails to Start

**Check:**

```bash
# View stopped tasks
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks $(aws ecs list-tasks --cluster my-cluster --desired-status STOPPED --query 'taskArns[0]' --output text)
```

**Common causes:**
- Image not found (ECR permissions)
- Secrets access denied
- Network configuration (subnets, security groups)
- Resource limits exceeded

### Container Keeps Restarting

**Debug:**

```bash
# Check CloudWatch logs
aws logs get-log-events \
  --log-group-name /ecs/web-app \
  --log-stream-name "ecs/web/abc123"

# Check task details
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks task-arn \
  --query 'tasks[0].containers[0].{reason:reason,exitCode:exitCode}'
```

**Causes:**
- Health check failing
- Application crashing
- Out of memory

### Service Stuck Deploying

```bash
# Check deployment status
aws ecs describe-services \
  --cluster my-cluster \
  --services web-service \
  --query 'services[0].deployments'

# Check events
aws ecs describe-services \
  --cluster my-cluster \
  --services web-service \
  --query 'services[0].events[:5]'
```

**Causes:**
- Health check failing on new tasks
- Not enough capacity
- Target group health checks failing

### Cannot Pull Image from ECR

**Check execution role has:**

```json
{
  "Effect": "Allow",
  "Action": [
    "ecr:GetAuthorizationToken",
    "ecr:BatchCheckLayerAvailability",
    "ecr:GetDownloadUrlForLayer",
    "ecr:BatchGetImage"
  ],
  "Resource": "*"
}
```

**Also check:**
- VPC endpoint for ECR (if private subnet)
- NAT gateway (if private subnet)
- Security group allows HTTPS outbound

## References

- [ECS Developer Guide](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/)
- [ECS API Reference](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/)
- [ECS CLI Reference](https://docs.aws.amazon.com/cli/latest/reference/ecs/)
- [boto3 ECS](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs.html)

Overview

This skill provides practical AWS ECS operations for deploying and managing Docker containers on Fargate or EC2. It guides you through creating clusters, registering task definitions, running tasks, and configuring services with load balancers and autoscaling. The content focuses on commands, patterns, and troubleshooting steps you can apply immediately.

How this skill works

The skill inspects and automates common ECS workflows: cluster creation, task definition registration, service creation and updates, standalone task runs, and autoscaling configuration. It includes concrete AWS CLI examples and diagnostic commands to inspect stopped tasks, logs, deployments, and service events. Use the patterns to deploy applications, update images, and resolve common failures quickly.

When to use it

  • Deploy containerized applications to AWS using ECS (Fargate or EC2).
  • Define and register task definitions with containers, environment variables, secrets, and logging.
  • Create or update services with load balancers and health checks.
  • Run batch or one-off jobs as standalone tasks.
  • Troubleshoot tasks that fail to start, containers that keep restarting, or services stuck deploying.

Best practices

  • Use IAM task roles for application permissions and execution roles for ECR and secrets access.
  • Store sensitive values in Secrets Manager or Parameter Store; avoid embedding secrets in images.
  • Right-size CPU and memory per task and enable container insights for monitoring.
  • Deploy across multiple Availability Zones and configure health checks and deregistration delays.
  • Use capacity providers and Fargate Spot for cost optimization and scalable capacity.

Example use cases

  • Create a Fargate cluster and register a web-app task definition with awslogs logging and health checks.
  • Deploy a two-task web service behind an ALB target group and enable a deployment circuit breaker for safer rollbacks.
  • Run a batch job as a standalone Fargate task with a private subnet and assignPublicIp as needed.
  • Update a service by registering a new task definition with an updated image and force a new deployment.
  • Configure target-tracking autoscaling for an ECS service based on average CPU utilization.

FAQ

What causes a task to stop immediately after starting?

Common causes include missing image permissions, execution role not allowing ECR access, secrets access denied, wrong network configuration, or resource limits exceeded. Check stopped task details and CloudWatch logs for exact failure reasons.

How do I update a service to use a new container image?

Register a new task definition revision with the updated image, then call update-service with the new taskDefinition and --force-new-deployment to trigger a rolling deployment.