home / skills / basher83 / lunar-claude / ansible-error-handling

ansible-error-handling skill

/plugins/infrastructure/ansible-workflows/skills/ansible-error-handling

This skill helps you implement robust Ansible error handling with block/rescue/always, retry logic, and clear failure messages.

npx playbooks add skill basher83/lunar-claude --skill ansible-error-handling

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

9.1 KB

---
name: ansible-error-handling
description: >
  This skill should be used when implementing error handling in Ansible, using
  block/rescue/always patterns, creating retry logic with until/retries, handling
  expected failures gracefully, or providing clear error messages with assert and fail.
---

# Ansible Error Handling

Patterns for robust error handling in Ansible playbooks and roles.

## Block/Rescue/Always Pattern

Handle errors and perform cleanup:

```yaml
- name: Deploy application
  block:
    - name: Stop application
      ansible.builtin.systemd:
        name: myapp
        state: stopped

    - name: Deploy new version
      ansible.builtin.copy:
        src: myapp-v2.0
        dest: /usr/bin/myapp

    - name: Start application
      ansible.builtin.systemd:
        name: myapp
        state: started

  rescue:
    - name: Rollback to previous version
      ansible.builtin.copy:
        src: myapp-backup
        dest: /usr/bin/myapp

    - name: Start application (rollback)
      ansible.builtin.systemd:
        name: myapp
        state: started

    - name: Report failure
      ansible.builtin.fail:
        msg: "Deployment failed, rolled back to previous version"

  always:
    - name: Cleanup temp files
      ansible.builtin.file:
        path: /tmp/deploy-*
        state: absent
```

### Execution Flow

- **block**: Main tasks execute sequentially
- **rescue**: Runs if ANY task in block fails
- **always**: Runs regardless of success/failure

## Retry with Until

Handle transient failures with retries:

```yaml
- name: Wait for service to be ready
  ansible.builtin.uri:
    url: http://localhost:8080/health
    status_code: 200
  register: health_check
  until: health_check.status == 200
  retries: 30
  delay: 10
  # Total wait: up to 5 minutes (30 * 10s)
```

### With Command Module

```yaml
- name: Wait for cluster to stabilize
  ansible.builtin.command: pvecm status
  register: cluster_status
  until: "'Quorate: Yes' in cluster_status.stdout"
  retries: 12
  delay: 5
  changed_when: false
```

### Retry Parameters

| Parameter | Description |
|-----------|-------------|
| `until` | Condition that must be true to stop retrying |
| `retries` | Maximum number of attempts |
| `delay` | Seconds between attempts |

## Assert for Validation

Validate inputs with clear error messages:

```yaml
- name: Validate required variables
  ansible.builtin.assert:
    that:
      - vm_name is defined
      - vm_name | length > 0
      - vm_memory >= 1024
      - vm_cores >= 1
    fail_msg: |
      Invalid VM configuration:
      - vm_name: {{ vm_name | default('NOT SET') }}
      - vm_memory: {{ vm_memory | default('NOT SET') }} (min: 1024)
      - vm_cores: {{ vm_cores | default('NOT SET') }} (min: 1)
    success_msg: "VM configuration validated"
    quiet: true
```

### Common Assertions

```yaml
# Variable defined and non-empty
- vm_name is defined and vm_name | trim | length > 0

# Numeric range
- vm_memory >= 1024 and vm_memory <= 65536

# Regex match
- vm_name is match('^[a-z0-9-]+$')

# List has items
- vm_networks | length > 0

# Value in allowed list
- vm_ostype in ['l26', 'win10', 'win11']
```

## Fail with Context

Provide actionable error messages:

```yaml
- name: Check prerequisites
  ansible.builtin.command: which docker
  register: docker_check
  changed_when: false
  failed_when: false

- name: Fail if Docker not installed
  ansible.builtin.fail:
    msg: |
      Docker is not installed on {{ inventory_hostname }}.

      To install Docker:
        sudo apt update
        sudo apt install docker.io

      Or use the docker role:
        ansible-playbook playbooks/install-docker.yml
  when: docker_check.rc != 0
```

## Graceful Failure Handling

Allow expected "failures":

```yaml
- name: Try to stop service
  ansible.builtin.systemd:
    name: myservice
    state: stopped
  register: stop_result
  failed_when:
    - stop_result.failed
    - "'not found' not in stop_result.msg"
  # Only fail if error is NOT "service not found"
```

### Multiple Acceptable Conditions

```yaml
- name: Join cluster
  ansible.builtin.command: pvecm add {{ primary_node }}
  register: cluster_join
  failed_when:
    - cluster_join.rc != 0
    - "'already in a cluster' not in cluster_join.stderr"
    - "'cannot join' not in cluster_join.stderr"
  changed_when: cluster_join.rc == 0
```

## Check Before Fail

Separate checking from failing for better control:

```yaml
- name: Check if resource exists
  ansible.builtin.command: check-resource {{ resource_id }}
  register: resource_check
  changed_when: false
  failed_when: false  # Don't fail here

- name: Fail with context if missing
  ansible.builtin.fail:
    msg: |
      Resource {{ resource_id }} not found.
      Command output: {{ resource_check.stderr }}
      Hint: Ensure resource was created first.
  when: resource_check.rc != 0
```

## Error Recovery Pattern

Attempt operation, handle specific errors:

```yaml
- name: Attempt primary approach
  block:
    - name: Connect via primary endpoint
      ansible.builtin.uri:
        url: "https://{{ primary_host }}:8006/api2/json"
        validate_certs: true
      register: primary_result

  rescue:
    - name: Log primary failure
      ansible.builtin.debug:
        msg: "Primary endpoint failed: {{ primary_result.msg | default('unknown error') }}"

    - name: Try fallback endpoint
      ansible.builtin.uri:
        url: "https://{{ fallback_host }}:8006/api2/json"
        validate_certs: false
      register: fallback_result
```

## Delegate Error Handling

Run checks from controller for better error context:

```yaml
- name: Verify API endpoint from controller
  ansible.builtin.uri:
    url: "https://{{ inventory_hostname }}:8006/api2/json/version"
    validate_certs: false
  delegate_to: localhost
  register: api_check
  failed_when: false

- name: Report API status
  ansible.builtin.fail:
    msg: |
      Cannot reach Proxmox API on {{ inventory_hostname }}
      Status: {{ api_check.status | default('connection failed') }}
      Check: Network connectivity, firewall rules, pveproxy service
  when: api_check.status | default(0) != 200
```

## Ignore Errors (Use Sparingly)

```yaml
- name: Remove optional backup
  ansible.builtin.file:
    path: /backup/old-backup.tar.gz
    state: absent
  ignore_errors: true
  register: cleanup_result

- name: Report cleanup status
  ansible.builtin.debug:
    msg: "Cleanup {{ 'successful' if not cleanup_result.failed else 'skipped' }}"
```

### When ignore_errors is Acceptable

- Non-critical cleanup tasks
- Optional operations that shouldn't block playbook
- When the result is immediately checked anyway

### Prefer failed_when

```yaml
# BETTER than ignore_errors
- name: Remove backup
  ansible.builtin.file:
    path: /backup/old-backup.tar.gz
    state: absent
  register: cleanup_result
  failed_when:
    - cleanup_result.failed
    - "'does not exist' not in cleanup_result.msg | default('')"
```

## Complete Example

```yaml
---
- name: Deploy with comprehensive error handling
  hosts: app_servers
  become: true

  tasks:
    - name: Validate configuration
      ansible.builtin.assert:
        that:
          - app_version is defined
          - app_version is match('^\d+\.\d+\.\d+$')
        fail_msg: "Invalid app_version: {{ app_version | default('NOT SET') }}"

    - name: Deploy application
      block:
        - name: Download release
          ansible.builtin.get_url:
            url: "https://releases.example.com/{{ app_version }}.tar.gz"
            dest: /tmp/app.tar.gz
          register: download
          until: download is succeeded
          retries: 3
          delay: 5

        - name: Stop current version
          ansible.builtin.systemd:
            name: myapp
            state: stopped

        - name: Extract release
          ansible.builtin.unarchive:
            src: /tmp/app.tar.gz
            dest: /opt/myapp
            remote_src: true

        - name: Start new version
          ansible.builtin.systemd:
            name: myapp
            state: started

        - name: Verify health
          ansible.builtin.uri:
            url: http://localhost:8080/health
          register: health
          until: health.status == 200
          retries: 6
          delay: 10

      rescue:
        - name: Restore previous version
          ansible.builtin.copy:
            src: /opt/myapp-backup/
            dest: /opt/myapp/
            remote_src: true

        - name: Start previous version
          ansible.builtin.systemd:
            name: myapp
            state: started

        - name: Report deployment failure
          ansible.builtin.fail:
            msg: |
              Deployment of {{ app_version }} failed.
              Previous version restored.
              Check logs: journalctl -u myapp

      always:
        - name: Cleanup download
          ansible.builtin.file:
            path: /tmp/app.tar.gz
            state: absent
```

## Additional Resources

For detailed error handling patterns and techniques, consult:

- **`references/error-handling.md`** - Comprehensive error handling patterns, block/rescue/always examples, retry strategies

## Related Skills

- **ansible-idempotency** - changed_when/failed_when patterns
- **ansible-fundamentals** - Core Ansible concepts

Overview

This skill provides practical patterns for robust error handling in Ansible playbooks and roles. It covers block/rescue/always flows, retry logic with until/retries, validation with assert, contextual fail messages, and techniques for graceful or delegated failures. Use these patterns to make deployments predictable, debuggable, and resilient to transient issues.

How this skill works

The skill inspects common failure scenarios and shows concrete task-level patterns to handle them. It demonstrates block/rescue/always for transactional steps, until/retries for transient checks, assert/fail for explicit validations, and failed_when/changed_when for fine-grained control. Examples include rollback on deploy failure, retrying health checks, delegating checks to the controller, and handling expected error messages.

When to use it

When a multi-step operation must roll back on error (use block/rescue/always).
When a service or API may be temporarily unavailable (use until/retries/delay).
When input variables must be validated before proceeding (use assert).
When you need actionable failure messages for operators (use fail with context).
When certain failures are acceptable and should not abort the play (use failed_when or ignore_errors sparingly).

Best practices

Prefer block/rescue/always for grouped operations and guaranteed cleanup steps.
Use until/retries with reasonable retries and delay to handle transient conditions instead of brittle waits.
Validate required variables early with assert and provide clear fail_msg with hints.
Avoid ignore_errors unless the task is truly non-critical; prefer failed_when to express acceptable errors.
Delegate checks to the controller when it yields clearer context or faster failure detection.

Example use cases

Deploying an application with download, stop, extract, start steps and automatic rollback on error.
Waiting for a web service or cluster to become healthy using uri/command with until and retries.
Validating VM or resource input parameters before provisioning using assert with detailed failure messages.
Checking prerequisites (like Docker) from the controller and failing with installation instructions.
Attempting a primary API endpoint and falling back to a secondary host in a rescue block.

FAQ

When should I use failed_when vs ignore_errors?

Use failed_when to explicitly define acceptable failure conditions and keep control of task outcomes. Use ignore_errors only for truly optional cleanup tasks that must not block the playbook.

How many retries and delay should I pick for until loops?

Choose values based on expected recovery time: short delays with more retries for intermittent issues, longer delays with fewer retries for slower systems. Start with retries: 6 and delay: 10s for service readiness, and tune from there.