home / skills / madappgang / claude-code / proof-of-work

proof-of-work skill

This skill generates verifiable task completion artifacts, including screenshots, test results, and confidence scores, to streamline auto-approval decisions.

npx playbooks add skill madappgang/claude-code --skill proof-of-work

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.6 KB

---
name: proof-of-work
description: Proof artifact generation patterns for task validation. Covers screenshots, test results, deployments, and confidence scoring.
version: 0.1.0
tags: [proof, validation, screenshots, tests, deployment]
keywords: [proof, artifact, screenshot, test, deployment, confidence, validation]
---
plugin: autopilot
updated: 2026-01-20

# Proof-of-Work

**Version:** 0.1.0
**Purpose:** Generate validation artifacts for autonomous task completion
**Status:** Phase 1

## When to Use

Use this skill when you need to:
- Generate proof artifacts after task completion
- Capture screenshots for UI verification
- Parse and report test results
- Calculate confidence scores for task validation
- Determine if a task can be auto-approved

## Overview

Proof-of-work is the mechanism that validates task completion. Every finished task must include verifiable artifacts that demonstrate the work was done correctly.

## Proof Types by Task

### Bug Fix Proof

| Artifact | Required | Purpose |
|----------|----------|---------|
| Git diff | Yes | Show minimal, focused changes |
| Test results | Yes | All tests passing |
| Regression test | Yes | Specific test for the bug |
| Error log (before/after) | Optional | Visual evidence |

### Feature Proof

| Artifact | Required | Purpose |
|----------|----------|---------|
| Screenshots | Yes | Visual verification |
| Test results | Yes | Functionality works |
| Coverage report | Yes | >= 80% coverage |
| Build output | Yes | Builds successfully |
| Deployment URL | Optional | Live demo |

### UI Change Proof

| Artifact | Required | Purpose |
|----------|----------|---------|
| Desktop screenshot | Yes | 1920x1080 view |
| Mobile screenshot | Yes | 375x667 view |
| Tablet screenshot | Yes | 768x1024 view |
| Accessibility score | Yes | >= 80 Lighthouse |
| Visual regression | Optional | BackstopJS diff |

## Screenshot Capture

**Playwright Pattern:**

```typescript
import { chromium } from 'playwright';

async function captureScreenshots(url: string, outputDir: string) {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  // Desktop
  await page.setViewportSize({ width: 1920, height: 1080 });
  await page.goto(url);
  await page.waitForLoadState('networkidle');
  await page.screenshot({
    path: `${outputDir}/desktop.png`,
    fullPage: true,
  });

  // Mobile
  await page.setViewportSize({ width: 375, height: 667 });
  await page.goto(url);
  await page.waitForLoadState('networkidle');
  await page.screenshot({
    path: `${outputDir}/mobile.png`,
    fullPage: true,
  });

  // Tablet
  await page.setViewportSize({ width: 768, height: 1024 });
  await page.goto(url);
  await page.waitForLoadState('networkidle');
  await page.screenshot({
    path: `${outputDir}/tablet.png`,
    fullPage: true,
  });

  await browser.close();
}
```

## Confidence Scoring

**Algorithm:**

```typescript
interface ProofArtifacts {
  testResults?: { passed: number; total: number };
  buildSuccessful?: boolean;
  lintErrors?: number;
  screenshots?: string[];
  testCoverage?: number;
  performanceScore?: number;
}

function calculateConfidence(artifacts: ProofArtifacts): number {
  let score = 0;

  // Tests (40 points)
  if (artifacts.testResults) {
    if (artifacts.testResults.passed === artifacts.testResults.total) {
      score += 40;
    }
  }

  // Build (20 points)
  if (artifacts.buildSuccessful) {
    score += 20;
  }

  // Coverage (20 points)
  if (artifacts.testCoverage) {
    if (artifacts.testCoverage >= 80) score += 20;
    else if (artifacts.testCoverage >= 60) score += 15;
    else if (artifacts.testCoverage >= 40) score += 10;
    else score += 5;
  }

  // Screenshots (10 points)
  if (artifacts.screenshots) {
    if (artifacts.screenshots.length >= 3) score += 10;
    else if (artifacts.screenshots.length >= 1) score += 5;
  }

  // Lint (10 points)
  if (artifacts.lintErrors === 0) {
    score += 10;
  }

  return score;
}
```

## Confidence Thresholds

| Confidence | Action |
|------------|--------|
| >= 95% | Auto-approve (In Review -> Done) |
| 80-94% | Manual review required |
| < 80% | Validation failed, iterate |

## Proof Summary Template

```markdown
# Proof of Work

**Task**: {issue_id}
**Type**: {task_type}
**Confidence**: {score}%

## Test Results
- Total: {total}
- Passed: {passed}
- Failed: {failed}
- Coverage: {coverage}%

## Build
- Status: {status}
- Duration: {duration}

## Screenshots
- Desktop: proof/desktop.png
- Mobile: proof/mobile.png
- Tablet: proof/tablet.png

## Artifacts
- test-results.txt
- coverage.json
- build-output.txt
```

## Examples

### Example 1: Feature Proof Generation

```typescript
const proof = {
  testResults: { passed: 15, total: 15 },
  buildSuccessful: true,
  lintErrors: 0,
  screenshots: ['desktop.png', 'mobile.png', 'tablet.png'],
  testCoverage: 85,
};

const confidence = calculateConfidence(proof);
// 40 (tests) + 20 (build) + 20 (coverage) + 10 (screenshots) + 10 (lint) = 100%
```

### Example 2: Partial Proof

```typescript
const proof = {
  testResults: { passed: 12, total: 15 },  // Some failing
  buildSuccessful: true,
  lintErrors: 2,
  screenshots: ['desktop.png'],
  testCoverage: 65,
};

const confidence = calculateConfidence(proof);
// 0 (tests fail) + 20 (build) + 15 (coverage) + 5 (1 screenshot) + 0 (lint errors) = 40%
// Result: Validation failed, must iterate
```

## Best Practices

- Always capture screenshots for UI work
- Run full test suite, not just affected tests
- Include coverage report for features
- Build must pass before any proof is valid
- Store proofs in session directory for debugging
- Generate proof summary in markdown for Linear comments

Overview

This skill generates verifiable proof artifacts to validate autonomous task completion. It standardizes screenshots, test and build outputs, coverage reports, and a confidence score to decide auto-approval. Use it to make task validation transparent, reproducible, and automatable.

How this skill works

The skill inspects produced artifacts (test results, build status, coverage, lint, and screenshots) and computes a confidence score using a weighted algorithm. It captures multi‑viewport screenshots via Playwright patterns, parses test outputs, and bundles artifacts into a proof summary markdown. Confidence thresholds drive actions: auto-approve, require manual review, or mark validation failed.

When to use it

After completing bug fixes to prove minimal diffs and regression tests
When shipping new features that need screenshots, coverage, and build proof
For UI changes requiring desktop, tablet, and mobile visual verification
To gate automated workflows (auto-approve vs manual review) based on evidence
When storing a compact, shareable proof summary for reviewers or auditors

Best practices

Capture full-page screenshots at specified viewports (1920x1080, 768x1024, 375x667) for UI changes
Run the full test suite and include a regression test for bugs rather than only affected tests
Ensure builds succeed and include build output in artifacts before validating proof
Include a coverage report (target >= 80% for full credit) and lint report to improve confidence
Store artifacts in a session directory and attach a markdown proof summary to the issue or review

Example use cases

Bug fix: attach git diff, passing regression test, before/after error logs, and test-results.txt
Feature release: include screenshots, coverage.json, build-output.txt, deployment URL, and a generated confidence score
UI tweak: provide desktop/tablet/mobile screenshots plus accessibility score and optional visual regression diff
CI automation: compute confidence after pipeline runs to auto-approve tasks that meet >=95% threshold
Manual review trigger: surface tasks in the 80–94% band for human verification with attached proof summary

FAQ

How is the confidence score calculated?

A weighted algorithm assigns points for tests, build success, coverage, screenshots, and lint; totals map to predefined thresholds for actions.

What artifacts are required for UI changes?

Desktop, tablet, and mobile screenshots plus an accessibility score (>=80) are required; visual regression diffs are optional.

When will a task be auto-approved?

Tasks scoring >=95% are eligible for auto-approval and moved from In Review to Done automatically.