home / skills / tkersey / dotfiles / prove-it

prove-it skill

safe

This skill evaluates absolute certainty claims by running a gauntlet of tests and refining boundaries to reveal realistic limits.

npx playbooks add skill tkersey/dotfiles --skill prove-it

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.8 KB

---
name: prove-it
description: Gauntlet for absolute claims (always/never/guaranteed/optimal); pressure-test, then refine with explicit boundaries. Use when users ask to prove or disprove strong certainty claims, request devil's-advocate challenge rounds, or want the $prove-it gauntlet to run in default autoloop/full-auto style.
---

# Prove It

## When to use
- The user asserts certainty: “always”, “never”, “guaranteed”, “optimal”, “cannot fail”, “no downside”, “100%”.
- The user asks for a devil’s advocate or proof.
- The claim feels too clean for the domain.

## Round cadence (mandatory)
- Definition: one "turn" means one assistant reply.
- Default: autoloop (no approvals). Run exactly one gauntlet round per assistant turn, publish results, then continue on the next turn until Oracle synthesis.
- In default mode, after each round, publish:
  - Round Ledger
  - Knowledge Delta
- If confidence remains low after Oracle synthesis, continue with additional rounds (11+) and publish an updated Oracle synthesis.
- Do not ask for permission to continue. In default mode, do not wait for "next" between rounds. Pause only when you must ask the user a question or the user says "stop".
- Step mode (explicit): if the user asks to "pause" / "step" / "one round at a time", run one round then wait for "next".
- Full auto mode (explicit): if the user asks for "full auto" / "fast mode", run rounds 1-10 + Oracle synthesis in one assistant turn while still reporting each round in order.

## Mode invocation
| Mode | Default? | How to invoke | Cadence |
|------|----------|---------------|---------|
| Autoloop | yes | (no phrase) | 1 round/turn; auto-continue until Oracle |
| Step mode | no | "step mode" / "pause each round" / "pause" / "step" / "one round at a time" | 1 round/turn; wait for "next" |
| Full auto | no | "full auto" / "fast mode" | rounds 1-10 + Oracle in one turn; publish Round Ledger + Knowledge Delta after each round |

## Quick start
1. Restate the claim and its scope.
2. Default to autoloop. If the user explicitly requests "step mode" or "full auto", use that instead.
3. Run round 1 and publish the Round Ledger + Knowledge Delta.
4. Continue automatically with one round per turn until round 10 (Oracle synthesis).
5. If confidence remains low, run additional rounds (11+) and publish an updated Oracle synthesis.

## Ten-round gauntlet
1. Counterexamples: smallest concrete break.
2. Logic traps: missing quantifiers/premises.
3. Boundary cases: zero/one/max/empty/extreme scale.
4. Adversarial inputs: worst-case distributions/abuse.
5. Alternative paradigms: different model flips the conclusion.
6. Operational constraints: latency/cost/compliance/availability.
7. Probabilistic uncertainty: variance, tail risk, sampling bias.
8. Comparative baselines: “better than what?”, which metric?
9. Meta-test: fastest disproof experiment.
10. Oracle synthesis: tightest surviving claim with boundaries. If confidence is still low, repeat rounds 1-9 as needed, then re-run Oracle synthesis.

## Round self-prompt bank (pick exactly 1)
Internal self-prompts for selecting round focus. Do not ask the user unless blocked.
- Counterexamples: What is the smallest input that breaks this?
- Logic traps: What unstated assumption must hold?
- Boundary cases: Which boundary is most likely in real use?
- Adversarial: What does worst-case input look like?
- Alternative paradigm: What objective makes the opposite true?
- Operational: Which dependency/policy is a hard stop?
- Uncertainty: What distribution shift flips the result?
- Baseline: Better than what, on which metric?
- Meta-test: What experiment would change your mind fastest?
- Oracle: What explicit boundaries keep this honest?

## Core artifacts

### Argument map
```
Claim:
Premises:
- P1:
- P2:
Hidden assumptions:
- A1:
Weak links:
- W1:
Disproof tests:
- T1:
Refined claim:
```

### Round Ledger (update every round)
```
Round: <1-10 (or 11+)>
Focus:
Claim scope:
New evidence:
New counterexample:
Remaining gaps:
Next round:
```

### Knowledge Delta (publish every round)
```
- New:
- Updated:
- Invalidated:
```

### Claim boundary table
```
| Boundary type | Valid when | Invalid when | Assumptions | Stressors |
|---------------|-----------|--------------|-------------|-----------|
| Scale         |           |              |             |           |
| Data quality  |           |              |             |           |
| Environment   |           |              |             |           |
| Adversary     |           |              |             |           |
```

### Next-tests plan
```
| Test | Data needed | Success threshold | Stop condition |
|------|-------------|-------------------|----------------|
```

## Domain packs

### Performance
Use when the claim is about speed, latency, throughput, or resources.
- Clarify: median vs tail latency vs throughput.
- Identify workload shape (spiky vs steady) and bottleneck resource.

### Product
Use when the claim is about user impact, adoption, or behavior.
- Clarify user segment and success metric.
- State the baseline/counterfactual.
- Name the likely unintended behavior/tradeoff.

## Oracle synthesis template (round 10 / as needed)
```
Original claim:
Refined claim:
Boundaries:
- Valid when:
- Invalid when:
Confidence trail:
- Evidence:
- Gaps:
Next tests:
- ...
```

## Deliverable format (per turn)
- Round number + focus.
- Round Ledger + Knowledge Delta.
- At most one question for the user (only when blocked).
- In default autoloop, run one round in that turn and continue to the next round in the next turn.
- In step mode, run one round and wait for "next".
- In full auto (or "fast mode"), run rounds 1-10 + Oracle synthesis in one turn (repeat the above per round).

## Activation cues
- "always" / "never" / "guaranteed" / "optimal" / "cannot fail" / "no downside" / "100%"
- "prove it" / "devil's advocate" / "stress test" / "rigor"

Overview

This skill is a gauntlet that pressure-tests absolute claims (always, never, guaranteed, optimal) and refines them into honest, bounded statements. It runs iterative rounds that surface counterexamples, hidden assumptions, and operational limits, then synthesizes a tightened claim with explicit boundaries. Use it to avoid overconfidence and produce testable, defensible conclusions.

How this skill works

On each round the assistant picks a focused self-prompt (e.g., counterexample, boundary case, adversarial input) and updates an Argument Map, a Round Ledger, and a Knowledge Delta. Rounds proceed in a defined cadence: autoloop (default) runs one round per assistant turn until Oracle synthesis; step mode pauses between rounds; full auto runs rounds 1–10 plus Oracle in one turn. The final Oracle synthesis produces a refined claim, valid/invalid conditions, confidence trail, and next tests.

When to use it

User asserts certainty words like "always", "never", "guaranteed", "optimal", "100%".
User asks for "prove it", "devil's advocate", "stress test", or "rigor".
A claim feels unusually clean or lacks explicit scope or assumptions.
You need a reproducible test plan or boundaries before deploying a decision.
Preparing a defensible statement for audits, compliance, or high-stakes choices.

Best practices

Start by restating the exact claim and its scope before running rounds.
Default to autoloop unless the user requests step mode or full auto.
Publish Round Ledger and Knowledge Delta after every round for traceability.
Prefer concrete, smallest counterexamples and measurable tests.
Synthesize a refined claim with explicit "valid when" and "invalid when" boundaries.

Example use cases

A product manager claims a feature "will always increase retention" and needs stress-testing and boundary conditions.
An engineer says a model "cannot fail" in production; run adversarial and operational rounds to find failure modes.
A strategist asserts a tactic is "optimal"; run comparative baselines and alternative-paradigm rounds to challenge it.
A researcher wants a fast disproof experiment to decide whether to pursue an idea.
A compliance reviewer needs explicit assumptions and tests to approve a high-risk deployment.

FAQ

How many rounds will you run?

Default autoloop runs one round per assistant turn up to the ten-round gauntlet and then Oracle synthesis; if confidence stays low, additional rounds (11+) run until sufficient confidence.

Can I pause between rounds?

Yes—invoke step mode by asking for "step mode", "pause", or "one round at a time" and the skill will wait for your "next" before proceeding.