home / skills / openclaw / skills / cron-retry

cron-retry skill

safe

This skill automatically detects and retries network-related failed cron jobs on heartbeat recovery, improving reliability when connectivity returns.

npx playbooks add skill openclaw/skills --skill cron-retry

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

2.9 KB

---
name: cron-retry
description: Auto-retry failed cron jobs on connection recovery. Use when cron jobs fail due to network errors and should be retried when connectivity is restored. Integrates with heartbeat to detect failed jobs and re-run them automatically.
---

# Cron Retry Skill

Automatically detect and retry cron jobs that failed due to network/connection errors.

## Quick Start (Heartbeat Integration)

Add this to your `HEARTBEAT.md`:

```markdown
## Cron Recovery Check
Check for cron jobs with lastStatus: "error". If the error matches network patterns (connection error, sendMessage failed, fetch failed, ETIMEDOUT, ECONNREFUSED), retry the job using cron tool with action: "run" and the job ID. Report what was recovered.
```

That's it. On each heartbeat, failed network jobs get retried automatically.

## How It Works

1. On heartbeat, check all cron jobs via `cron list`
2. Filter for jobs where `lastStatus = "error"` and `enabled = true`
3. Check if `lastError` matches network-related patterns
4. Re-run eligible jobs via `cron run`
5. Report results

## Network Error Patterns (Retryable)

These errors indicate transient network issues worth retrying:

- `Network request.*failed`
- `Connection error`
- `ECONNREFUSED`
- `ETIMEDOUT`
- `ENOTFOUND`
- `sendMessage.*failed`
- `fetch failed`
- `socket hang up`

## What Gets Retried vs Skipped

**Retried:**
- Network timeouts
- Connection refused
- Message send failures
- DNS lookup failures

**Skipped (not retried):**
- Logic errors (bad config, missing data)
- Auth failures
- Disabled jobs
- Jobs that just ran successfully

## Manual Recovery Check

To check and retry failed jobs manually:

```bash
# List all jobs and their status
clawdbot cron list

# Find failed jobs
clawdbot cron list | jq '.jobs[] | select(.state.lastStatus == "error") | {name, error: .state.lastError}'

# Retry a specific job
clawdbot cron run --id <JOB_ID>
```

## Agent Implementation

When implementing the heartbeat check:

```
1. Call cron tool with action: "list"
2. For each job in response.jobs:
   - Skip if job.enabled !== true
   - Skip if job.state.lastStatus !== "error"
   - Check if job.state.lastError matches network patterns
   - If retryable: call cron tool with action: "run", jobId: job.id
3. Report: "Recovered X jobs" or "No failed jobs to recover"
```

## Example Scenario

1. **7:00 PM** — Evening briefing cron fires
2. **Network hiccup** — Telegram send fails
3. **Job marked** `lastStatus: "error"`, `lastError: "Network request for 'sendMessage' failed!"`
4. **7:15 PM** — Connection restored, heartbeat runs
5. **Skill detects** the failed job, sees it's a network error
6. **Retries** the job → briefing delivered
7. **Reports**: "Recovered 1 job: evening-wrap-briefing"

## Safety

- Only retries transient network errors
- Respects job enabled state
- Won't create retry loops (checks lastRunAtMs)
- Reports all recovery attempts

Overview

This skill automatically detects cron jobs that failed due to transient network or connection errors and retries them when connectivity is restored. It integrates with a heartbeat process and the cron tool to re-run eligible jobs and report recovery outcomes. The goal is to reduce missed tasks caused by temporary network issues while avoiding unsafe retries.

How this skill works

On each heartbeat the skill lists all cron jobs, filters for enabled jobs whose lastStatus is "error", and checks the lastError text against known network-related patterns. If an error matches retryable patterns (timeouts, connection refused, DNS failures, sendMessage/fetch failures), the skill calls the cron run action for that job ID. It logs and reports how many jobs were retried and which ones were recovered.

When to use it

Your cron jobs intermittently fail due to network outages or API connectivity issues.
You want automatic recovery after transient errors without manual intervention.
You run a heartbeat service that can run periodic checks and trigger cron tool actions.
You need to ensure deliveries (notifications, reports) are not permanently lost due to short network blips.

Best practices

Only match and retry errors that clearly indicate transient network problems (ETIMEDOUT, ECONNREFUSED, connection errors, fetch/sendMessage failures).
Respect job.enabled and job.lastRunAt to avoid creating retry loops or re-running intentionally disabled jobs.
Limit retry frequency and add deduplication or backoff if a job repeatedly fails for the same reason.
Log each recovery attempt and the resulting status so the heartbeat report provides clear auditability.
Skip retries for auth failures, config issues, or application logic errors; surface these to an operator instead.

Example use cases

A messaging cron that posts daily briefings fails during a short Telegram outage and should be retried once connectivity returns.
A nightly backup job that failed to upload due to a temporary S3 network error is automatically rerun on next heartbeat.
A monitoring alert sender that experienced DNS resolution errors gets its failed alert retried after the DNS issue clears.
A data fetch cron that returned fetch failed due to upstream API timeouts is retried when the upstream is reachable again.

FAQ

How does the skill avoid retry loops?

It respects enabled state and checks recent run timestamps; implement backoff or dedup checks to prevent immediate repeated retries.

Which errors will not be retried?

Auth failures, bad configuration, missing data, and other application logic errors are skipped and should be handled manually.