home / skills / tkersey / dotfiles / prove-it

prove-it skill

/codex/skills/prove-it

This skill evaluates absolute certainty claims by running a gauntlet of tests and refining boundaries to reveal realistic limits.

npx playbooks add skill tkersey/dotfiles --skill prove-it

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.8 KB
---
name: prove-it
description: Gauntlet for absolute claims (always/never/guaranteed/optimal); pressure-test, then refine with explicit boundaries. Use when users ask to prove or disprove strong certainty claims, request devil's-advocate challenge rounds, or want the $prove-it gauntlet to run in default autoloop/full-auto style.
---

# Prove It

## When to use
- The user asserts certainty: “always”, “never”, “guaranteed”, “optimal”, “cannot fail”, “no downside”, “100%”.
- The user asks for a devil’s advocate or proof.
- The claim feels too clean for the domain.

## Round cadence (mandatory)
- Definition: one "turn" means one assistant reply.
- Default: autoloop (no approvals). Run exactly one gauntlet round per assistant turn, publish results, then continue on the next turn until Oracle synthesis.
- In default mode, after each round, publish:
  - Round Ledger
  - Knowledge Delta
- If confidence remains low after Oracle synthesis, continue with additional rounds (11+) and publish an updated Oracle synthesis.
- Do not ask for permission to continue. In default mode, do not wait for "next" between rounds. Pause only when you must ask the user a question or the user says "stop".
- Step mode (explicit): if the user asks to "pause" / "step" / "one round at a time", run one round then wait for "next".
- Full auto mode (explicit): if the user asks for "full auto" / "fast mode", run rounds 1-10 + Oracle synthesis in one assistant turn while still reporting each round in order.

## Mode invocation
| Mode | Default? | How to invoke | Cadence |
|------|----------|---------------|---------|
| Autoloop | yes | (no phrase) | 1 round/turn; auto-continue until Oracle |
| Step mode | no | "step mode" / "pause each round" / "pause" / "step" / "one round at a time" | 1 round/turn; wait for "next" |
| Full auto | no | "full auto" / "fast mode" | rounds 1-10 + Oracle in one turn; publish Round Ledger + Knowledge Delta after each round |

## Quick start
1. Restate the claim and its scope.
2. Default to autoloop. If the user explicitly requests "step mode" or "full auto", use that instead.
3. Run round 1 and publish the Round Ledger + Knowledge Delta.
4. Continue automatically with one round per turn until round 10 (Oracle synthesis).
5. If confidence remains low, run additional rounds (11+) and publish an updated Oracle synthesis.

## Ten-round gauntlet
1. Counterexamples: smallest concrete break.
2. Logic traps: missing quantifiers/premises.
3. Boundary cases: zero/one/max/empty/extreme scale.
4. Adversarial inputs: worst-case distributions/abuse.
5. Alternative paradigms: different model flips the conclusion.
6. Operational constraints: latency/cost/compliance/availability.
7. Probabilistic uncertainty: variance, tail risk, sampling bias.
8. Comparative baselines: “better than what?”, which metric?
9. Meta-test: fastest disproof experiment.
10. Oracle synthesis: tightest surviving claim with boundaries. If confidence is still low, repeat rounds 1-9 as needed, then re-run Oracle synthesis.

## Round self-prompt bank (pick exactly 1)
Internal self-prompts for selecting round focus. Do not ask the user unless blocked.
- Counterexamples: What is the smallest input that breaks this?
- Logic traps: What unstated assumption must hold?
- Boundary cases: Which boundary is most likely in real use?
- Adversarial: What does worst-case input look like?
- Alternative paradigm: What objective makes the opposite true?
- Operational: Which dependency/policy is a hard stop?
- Uncertainty: What distribution shift flips the result?
- Baseline: Better than what, on which metric?
- Meta-test: What experiment would change your mind fastest?
- Oracle: What explicit boundaries keep this honest?

## Core artifacts

### Argument map
```
Claim:
Premises:
- P1:
- P2:
Hidden assumptions:
- A1:
Weak links:
- W1:
Disproof tests:
- T1:
Refined claim:
```

### Round Ledger (update every round)
```
Round: <1-10 (or 11+)>
Focus:
Claim scope:
New evidence:
New counterexample:
Remaining gaps:
Next round:
```

### Knowledge Delta (publish every round)
```
- New:
- Updated:
- Invalidated:
```

### Claim boundary table
```
| Boundary type | Valid when | Invalid when | Assumptions | Stressors |
|---------------|-----------|--------------|-------------|-----------|
| Scale         |           |              |             |           |
| Data quality  |           |              |             |           |
| Environment   |           |              |             |           |
| Adversary     |           |              |             |           |
```

### Next-tests plan
```
| Test | Data needed | Success threshold | Stop condition |
|------|-------------|-------------------|----------------|
```

## Domain packs

### Performance
Use when the claim is about speed, latency, throughput, or resources.
- Clarify: median vs tail latency vs throughput.
- Identify workload shape (spiky vs steady) and bottleneck resource.

### Product
Use when the claim is about user impact, adoption, or behavior.
- Clarify user segment and success metric.
- State the baseline/counterfactual.
- Name the likely unintended behavior/tradeoff.

## Oracle synthesis template (round 10 / as needed)
```
Original claim:
Refined claim:
Boundaries:
- Valid when:
- Invalid when:
Confidence trail:
- Evidence:
- Gaps:
Next tests:
- ...
```

## Deliverable format (per turn)
- Round number + focus.
- Round Ledger + Knowledge Delta.
- At most one question for the user (only when blocked).
- In default autoloop, run one round in that turn and continue to the next round in the next turn.
- In step mode, run one round and wait for "next".
- In full auto (or "fast mode"), run rounds 1-10 + Oracle synthesis in one turn (repeat the above per round).

## Activation cues
- "always" / "never" / "guaranteed" / "optimal" / "cannot fail" / "no downside" / "100%"
- "prove it" / "devil's advocate" / "stress test" / "rigor"

Overview

This skill is a gauntlet that pressure-tests absolute claims (always, never, guaranteed, optimal) and refines them into honest, bounded statements. It runs iterative rounds that surface counterexamples, hidden assumptions, and operational limits, then synthesizes a tightened claim with explicit boundaries. Use it to avoid overconfidence and produce testable, defensible conclusions.

How this skill works

On each round the assistant picks a focused self-prompt (e.g., counterexample, boundary case, adversarial input) and updates an Argument Map, a Round Ledger, and a Knowledge Delta. Rounds proceed in a defined cadence: autoloop (default) runs one round per assistant turn until Oracle synthesis; step mode pauses between rounds; full auto runs rounds 1–10 plus Oracle in one turn. The final Oracle synthesis produces a refined claim, valid/invalid conditions, confidence trail, and next tests.

When to use it

  • User asserts certainty words like "always", "never", "guaranteed", "optimal", "100%".
  • User asks for "prove it", "devil's advocate", "stress test", or "rigor".
  • A claim feels unusually clean or lacks explicit scope or assumptions.
  • You need a reproducible test plan or boundaries before deploying a decision.
  • Preparing a defensible statement for audits, compliance, or high-stakes choices.

Best practices

  • Start by restating the exact claim and its scope before running rounds.
  • Default to autoloop unless the user requests step mode or full auto.
  • Publish Round Ledger and Knowledge Delta after every round for traceability.
  • Prefer concrete, smallest counterexamples and measurable tests.
  • Synthesize a refined claim with explicit "valid when" and "invalid when" boundaries.

Example use cases

  • A product manager claims a feature "will always increase retention" and needs stress-testing and boundary conditions.
  • An engineer says a model "cannot fail" in production; run adversarial and operational rounds to find failure modes.
  • A strategist asserts a tactic is "optimal"; run comparative baselines and alternative-paradigm rounds to challenge it.
  • A researcher wants a fast disproof experiment to decide whether to pursue an idea.
  • A compliance reviewer needs explicit assumptions and tests to approve a high-risk deployment.

FAQ

How many rounds will you run?

Default autoloop runs one round per assistant turn up to the ten-round gauntlet and then Oracle synthesis; if confidence stays low, additional rounds (11+) run until sufficient confidence.

Can I pause between rounds?

Yes—invoke step mode by asking for "step mode", "pause", or "one round at a time" and the skill will wait for your "next" before proceeding.