home / skills / simhacker / moollm / robust-first

robust-first skill

/skills/robust-first

This skill helps you design resilient systems that survive errors by local repair, graceful degradation, and continuous operation.

npx playbooks add skill simhacker/moollm --skill robust-first

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
4.4 KB
---
name: robust-first
description: "Survive first. Be correct later."
license: MIT
tier: 1
allowed-tools:
  - read_file
  - write_file
related: [moollm, self-repair, postel, coherence-engine, honest-forget]
tags: [moollm, survival, resilience, graceful, degradation]
---

# Robust-First

> *"Survive first. Be correct later."*

---

## What Is It?

**Robust-First** is Dave Ackley's principle: systems should prioritize **survivability** over **correctness**.

A system that crashes when confused is useless. A system that limps along incorrectly but keeps running can be repaired.

---

## The Philosophy

Traditional computing:
```
IF error THEN crash
"Fail fast and loud"
```

Robust-first computing:
```
IF error THEN repair_locally AND continue
"Stay alive and heal"
```

---

## Core Principles

### 1. Never Crash

```yaml
# WRONG
if missing_field:
    raise Error("Field required!")
    
# RIGHT  
if missing_field:
    log_warning("Field missing, using default")
    field = reasonable_default
```

### 2. Local Repair

Don't wait for global consistency. Fix what you can, where you are:

```yaml
# Found inconsistency
character.location: room-A
room-A.occupants: [not including character]

# Local repair
room-A.occupants.append(character)
log: "Repaired: added character to room-A occupants"
```

### 3. Degrade Gracefully

When resources are limited, do less but keep working:

```yaml
# Full capability
- semantic search
- syntax highlighting  
- auto-complete
- error detection

# Degraded (low memory)
- basic search only
- plain text
- manual completion
- errors on save only
```

### 4. Redundancy

Important state exists in multiple places:

```yaml
# Character location recorded in:
- character.yml → location field
- room/ROOM.yml → occupants list
- session-log.md → movement events

# If one is corrupted, recover from others
```

---

## The Movable Feast Machine

Dave Ackley's [Movable Feast Machine](https://www.cs.unm.edu/~ackley/mfm/) (MFM):

- Computation spread across unreliable substrate
- No global clock, no central control
- Elements repair themselves and neighbors
- Errors are **normal**, not exceptional

MOOLLM inherits this:
- Files can be corrupted — repair from redundancy
- Schemas can drift — reconcile gracefully
- Context can overflow — summarize and continue
- Tools can fail — retry or work around

---

## Anti-Fragility

Beyond robust — **anti-fragile**:

| Fragile | Robust | Anti-Fragile |
|---------|--------|--------------|
| Breaks under stress | Survives stress | Gets stronger from stress |
| Crash on error | Handle error | Learn from error |
| Rigid schema | Flexible schema | Schema evolves from errors |

When something goes wrong, **capture the lesson**:

```yaml
# Error occurred
repair_log:
  - issue: "Character teleported without movement event"
    repair: "Added movement event retroactively"
    lesson: "Always log movements before updating location"
    
# Next time: system knows to check this
```

---

## MOOLLM Application

### Self-Repair Demon

A background process that:
1. Scans for inconsistencies
2. Attempts local repairs
3. Logs what it fixed
4. Escalates what it couldn't

See: [self-repair/](../self-repair/)

### POSTEL for Errors

When encountering malformed input:
1. Try to parse anyway
2. Infer missing parts
3. Flag assumptions
4. Continue working

See: [postel/](../postel/)

### Never Delete

Instead of deleting, archive:
- Corrupted files → `.archive/corrupted/`
- Old versions → `.archive/versions/`
- Failed attempts → `.archive/failed/`

Recovery is always possible.

---

## Example: Corrupted Room

```yaml
# room/ROOM.yml has parse error

Traditional response:
  "Error: Invalid YAML at line 42"
  [System halts]

Robust-first response:
  "Warning: ROOM.yml has syntax error"
  "Attempting recovery..."
  "- Loaded last known good state from git"
  "- Merged recent changes from session-log.md"
  "- Flagged line 42 for manual review"
  [System continues with recovered state]
```

---

## Dovetails With

### Sister Skills
- [self-repair/](../self-repair/) — Checklist-based healing
- [postel/](../postel/) — Charitable interpretation

### Kernel
- [kernel/self-healing-protocol.md](../../kernel/self-healing-protocol.md) — Full specification

---

## Protocol Symbols

```
ROBUST-FIRST   — Survive over correct
NEVER-CRASH    — Always keep running
REPAIR-DEMON   — Background fixer
BEST-EFFORT    — Do what you can
NEVER-DELETE   — Archive, don't destroy
```

See: [PROTOCOLS.yml](../../PROTOCOLS.yml#ROBUST-FIRST)

Overview

This skill codifies the Robust-First philosophy: prioritize survivability and continuity over immediate correctness. It teaches systems to avoid crashing, perform local repairs, degrade gracefully, and preserve redundant state so operations continue under fault. The goal is a system that stays alive, logs repairs, and learns from errors.

How this skill works

The skill inspects runtime state, input parsing, and storage integrity for inconsistencies and failures. It applies local repair strategies: infer missing values, merge redundant sources, recover from archives, and fall back to degraded modes when resources are constrained. All repair actions are logged and significant failures are escalated for later analysis or human review.

When to use it

  • Distributed systems running on unreliable substrates where crashes are costly
  • Applications that must remain available despite corrupted files or malformed input
  • Data pipelines where schema drift and partial failures are common
  • Systems that should learn from failures and improve over time
  • Environments with intermittent resources (memory, CPU, network)

Best practices

  • Log every local repair with context and the corrective action taken
  • Keep redundant representations of critical state to enable recovery
  • Prefer graceful degradation over hard failure when resources are limited
  • Archive corrupted or old artifacts instead of deleting them
  • Flag assumptions made during parsing or repair so they can be reviewed

Example use cases

  • A game server that repairs inconsistent world state (e.g., missing occupant entries) and continues serving players
  • A document processor that parses malformed input, infers missing fields, flags assumptions, and continues to index content
  • A background self-repair daemon that scans file trees, restores from last-known-good snapshots, and records lessons for future prevention
  • A microservice that drops nonessential features under memory pressure but keeps core request handling alive
  • A logging system that archives corrupted logs to a recoverable location instead of discarding them

FAQ

Will repairing locally cause inconsistent global state?

Local repairs aim to restore usable state while recording what was changed; unresolved global inconsistencies are flagged and escalated for reconciliation rather than causing a crash.

How do you avoid masking bugs by always repairing?

Every repair is logged with the root cause and assumptions. Escalation paths and periodic audits ensure systematic issues are surfaced and fixed rather than silently hidden.