home / skills / madappgang / claude-code / proof-of-work

This skill generates verifiable task completion artifacts, including screenshots, test results, and confidence scores, to streamline auto-approval decisions.

npx playbooks add skill madappgang/claude-code --skill proof-of-work

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.6 KB
---
name: proof-of-work
description: Proof artifact generation patterns for task validation. Covers screenshots, test results, deployments, and confidence scoring.
version: 0.1.0
tags: [proof, validation, screenshots, tests, deployment]
keywords: [proof, artifact, screenshot, test, deployment, confidence, validation]
---
plugin: autopilot
updated: 2026-01-20

# Proof-of-Work

**Version:** 0.1.0
**Purpose:** Generate validation artifacts for autonomous task completion
**Status:** Phase 1

## When to Use

Use this skill when you need to:
- Generate proof artifacts after task completion
- Capture screenshots for UI verification
- Parse and report test results
- Calculate confidence scores for task validation
- Determine if a task can be auto-approved

## Overview

Proof-of-work is the mechanism that validates task completion. Every finished task must include verifiable artifacts that demonstrate the work was done correctly.

## Proof Types by Task

### Bug Fix Proof

| Artifact | Required | Purpose |
|----------|----------|---------|
| Git diff | Yes | Show minimal, focused changes |
| Test results | Yes | All tests passing |
| Regression test | Yes | Specific test for the bug |
| Error log (before/after) | Optional | Visual evidence |

### Feature Proof

| Artifact | Required | Purpose |
|----------|----------|---------|
| Screenshots | Yes | Visual verification |
| Test results | Yes | Functionality works |
| Coverage report | Yes | >= 80% coverage |
| Build output | Yes | Builds successfully |
| Deployment URL | Optional | Live demo |

### UI Change Proof

| Artifact | Required | Purpose |
|----------|----------|---------|
| Desktop screenshot | Yes | 1920x1080 view |
| Mobile screenshot | Yes | 375x667 view |
| Tablet screenshot | Yes | 768x1024 view |
| Accessibility score | Yes | >= 80 Lighthouse |
| Visual regression | Optional | BackstopJS diff |

## Screenshot Capture

**Playwright Pattern:**

```typescript
import { chromium } from 'playwright';

async function captureScreenshots(url: string, outputDir: string) {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  // Desktop
  await page.setViewportSize({ width: 1920, height: 1080 });
  await page.goto(url);
  await page.waitForLoadState('networkidle');
  await page.screenshot({
    path: `${outputDir}/desktop.png`,
    fullPage: true,
  });

  // Mobile
  await page.setViewportSize({ width: 375, height: 667 });
  await page.goto(url);
  await page.waitForLoadState('networkidle');
  await page.screenshot({
    path: `${outputDir}/mobile.png`,
    fullPage: true,
  });

  // Tablet
  await page.setViewportSize({ width: 768, height: 1024 });
  await page.goto(url);
  await page.waitForLoadState('networkidle');
  await page.screenshot({
    path: `${outputDir}/tablet.png`,
    fullPage: true,
  });

  await browser.close();
}
```

## Confidence Scoring

**Algorithm:**

```typescript
interface ProofArtifacts {
  testResults?: { passed: number; total: number };
  buildSuccessful?: boolean;
  lintErrors?: number;
  screenshots?: string[];
  testCoverage?: number;
  performanceScore?: number;
}

function calculateConfidence(artifacts: ProofArtifacts): number {
  let score = 0;

  // Tests (40 points)
  if (artifacts.testResults) {
    if (artifacts.testResults.passed === artifacts.testResults.total) {
      score += 40;
    }
  }

  // Build (20 points)
  if (artifacts.buildSuccessful) {
    score += 20;
  }

  // Coverage (20 points)
  if (artifacts.testCoverage) {
    if (artifacts.testCoverage >= 80) score += 20;
    else if (artifacts.testCoverage >= 60) score += 15;
    else if (artifacts.testCoverage >= 40) score += 10;
    else score += 5;
  }

  // Screenshots (10 points)
  if (artifacts.screenshots) {
    if (artifacts.screenshots.length >= 3) score += 10;
    else if (artifacts.screenshots.length >= 1) score += 5;
  }

  // Lint (10 points)
  if (artifacts.lintErrors === 0) {
    score += 10;
  }

  return score;
}
```

## Confidence Thresholds

| Confidence | Action |
|------------|--------|
| >= 95% | Auto-approve (In Review -> Done) |
| 80-94% | Manual review required |
| < 80% | Validation failed, iterate |

## Proof Summary Template

```markdown
# Proof of Work

**Task**: {issue_id}
**Type**: {task_type}
**Confidence**: {score}%

## Test Results
- Total: {total}
- Passed: {passed}
- Failed: {failed}
- Coverage: {coverage}%

## Build
- Status: {status}
- Duration: {duration}

## Screenshots
- Desktop: proof/desktop.png
- Mobile: proof/mobile.png
- Tablet: proof/tablet.png

## Artifacts
- test-results.txt
- coverage.json
- build-output.txt
```

## Examples

### Example 1: Feature Proof Generation

```typescript
const proof = {
  testResults: { passed: 15, total: 15 },
  buildSuccessful: true,
  lintErrors: 0,
  screenshots: ['desktop.png', 'mobile.png', 'tablet.png'],
  testCoverage: 85,
};

const confidence = calculateConfidence(proof);
// 40 (tests) + 20 (build) + 20 (coverage) + 10 (screenshots) + 10 (lint) = 100%
```

### Example 2: Partial Proof

```typescript
const proof = {
  testResults: { passed: 12, total: 15 },  // Some failing
  buildSuccessful: true,
  lintErrors: 2,
  screenshots: ['desktop.png'],
  testCoverage: 65,
};

const confidence = calculateConfidence(proof);
// 0 (tests fail) + 20 (build) + 15 (coverage) + 5 (1 screenshot) + 0 (lint errors) = 40%
// Result: Validation failed, must iterate
```

## Best Practices

- Always capture screenshots for UI work
- Run full test suite, not just affected tests
- Include coverage report for features
- Build must pass before any proof is valid
- Store proofs in session directory for debugging
- Generate proof summary in markdown for Linear comments

Overview

This skill generates verifiable proof artifacts to validate autonomous task completion. It standardizes screenshots, test and build outputs, coverage reports, and a confidence score to decide auto-approval. Use it to make task validation transparent, reproducible, and automatable.

How this skill works

The skill inspects produced artifacts (test results, build status, coverage, lint, and screenshots) and computes a confidence score using a weighted algorithm. It captures multi‑viewport screenshots via Playwright patterns, parses test outputs, and bundles artifacts into a proof summary markdown. Confidence thresholds drive actions: auto-approve, require manual review, or mark validation failed.

When to use it

  • After completing bug fixes to prove minimal diffs and regression tests
  • When shipping new features that need screenshots, coverage, and build proof
  • For UI changes requiring desktop, tablet, and mobile visual verification
  • To gate automated workflows (auto-approve vs manual review) based on evidence
  • When storing a compact, shareable proof summary for reviewers or auditors

Best practices

  • Capture full-page screenshots at specified viewports (1920x1080, 768x1024, 375x667) for UI changes
  • Run the full test suite and include a regression test for bugs rather than only affected tests
  • Ensure builds succeed and include build output in artifacts before validating proof
  • Include a coverage report (target >= 80% for full credit) and lint report to improve confidence
  • Store artifacts in a session directory and attach a markdown proof summary to the issue or review

Example use cases

  • Bug fix: attach git diff, passing regression test, before/after error logs, and test-results.txt
  • Feature release: include screenshots, coverage.json, build-output.txt, deployment URL, and a generated confidence score
  • UI tweak: provide desktop/tablet/mobile screenshots plus accessibility score and optional visual regression diff
  • CI automation: compute confidence after pipeline runs to auto-approve tasks that meet >=95% threshold
  • Manual review trigger: surface tasks in the 80–94% band for human verification with attached proof summary

FAQ

How is the confidence score calculated?

A weighted algorithm assigns points for tests, build success, coverage, screenshots, and lint; totals map to predefined thresholds for actions.

What artifacts are required for UI changes?

Desktop, tablet, and mobile screenshots plus an accessibility score (>=80) are required; visual regression diffs are optional.

When will a task be auto-approved?

Tasks scoring >=95% are eligible for auto-approval and moved from In Review to Done automatically.