home / skills / ed3dai / ed3d-plugins / writing-good-tests

writing-good-tests skill

/plugins/ed3d-house-style/skills/writing-good-tests

This skill helps you write robust tests by emphasizing integration over unit tests, waiting for real conditions, and using strategic mocks.

npx playbooks add skill ed3dai/ed3d-plugins --skill writing-good-tests

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
10.5 KB
---
name: writing-good-tests
description: Use when writing or reviewing tests - covers test philosophy, condition-based waiting, mocking strategy, and test isolation
user-invocable: false
---

# Writing Good Tests

## Philosophy

**"Write tests. Not too many. Mostly integration."** — Kent C. Dodds

Tests verify real behavior, not implementation details. The goal is confidence that your code works, not coverage numbers.

**Core principles:**
1. Test behavior, not implementation — refactoring shouldn't break tests
2. Integration tests provide better confidence-to-cost ratio than unit tests
3. Wait for actual conditions, not arbitrary timeouts
4. Mock strategically — real dependencies when feasible, mocks for external systems
5. Don't pollute production code with test-only methods

## Test Structure

Use **Arrange-Act-Assert** (or Given-When-Then):

```typescript
test('user can cancel reservation', async () => {
  // Arrange
  const reservation = await createReservation({ userId: 'user-1', roomId: 'room-1' });

  // Act
  const result = await cancelReservation(reservation.id);

  // Assert
  expect(result.status).toBe('cancelled');
  expect(await getReservation(reservation.id)).toBeNull();
});
```

**One action per test.** Multiple assertions are fine if they verify the same behavior.

## Condition-Based Waiting

Flaky tests often guess at timing. This creates race conditions where tests pass locally but fail in CI.

**Wait for conditions, not time:**

```typescript
// BAD: Guessing at timing
await new Promise(r => setTimeout(r, 50));
const result = getResult();

// GOOD: Waiting for condition
await waitFor(() => getResult() !== undefined);
const result = getResult();
```

### Generic Polling Function

```typescript
async function waitFor<T>(
  condition: () => T | undefined | null | false,
  description: string,
  timeoutMs = 5000
): Promise<T> {
  const startTime = Date.now();

  while (true) {
    const result = condition();
    if (result) return result;

    if (Date.now() - startTime > timeoutMs) {
      throw new Error(`Timeout waiting for ${description} after ${timeoutMs}ms`);
    }

    await new Promise(r => setTimeout(r, 10)); // Poll every 10ms
  }
}
```

### Quick Patterns

| Scenario | Pattern |
|----------|---------|
| Wait for event | `waitFor(() => events.find(e => e.type === 'DONE'))` |
| Wait for state | `waitFor(() => machine.state === 'ready')` |
| Wait for count | `waitFor(() => items.length >= 5)` |

### When Arbitrary Timeout IS Correct

Only when testing actual timing behavior (debounce, throttle, intervals):

```typescript
// Testing tool that ticks every 100ms
await waitForEvent(manager, 'TOOL_STARTED'); // First: wait for condition
await new Promise(r => setTimeout(r, 200));   // Then: wait for 2 ticks
// Comment explains WHY: 200ms = 2 ticks at 100ms intervals
```

## Mocking Strategy

> "You don't hate mocks; you hate side-effects." — J.B. Rainsberger

Mocks reveal where side-effects complicate your code. Use them strategically, not reflexively.

### Don't Mock What You Don't Own

Create thin wrappers around third-party libraries. Mock YOUR wrapper, not the library.

```typescript
// BAD: Mock the HTTP client directly
const mockClient = vi.mocked(httpx.Client);

// GOOD: Create your own wrapper
class RegistryClient {
  constructor(private client: HttpClient) {}
  async getRepos() {
    return this.client.get('https://registry.example.com/v2/_catalog');
  }
}

// Mock your wrapper
vi.mock('./registry-client');
```

This simplifies tests AND improves your design.

### Managed vs Unmanaged Dependencies

| Dependency Type | Example | Strategy |
|-----------------|---------|----------|
| **Managed** (you control it) | Your database, your file system | Use REAL instances |
| **Unmanaged** (external) | Third-party APIs, SMTP, message bus | Use MOCKS |

Communications with managed dependencies are implementation details — you can refactor them freely. Communications with unmanaged dependencies are observable behavior — mocking protects against external changes.

### Anti-Pattern: Testing Mock Behavior

```typescript
// BAD: Testing that the mock exists
test('renders sidebar', () => {
  render(<Page />);
  expect(screen.getByTestId('sidebar-mock')).toBeInTheDocument();
});

// GOOD: Test real behavior
test('renders sidebar', () => {
  render(<Page />);
  expect(screen.getByRole('navigation')).toBeInTheDocument();
});
```

**Gate:** Before asserting on any mock element, ask: "Am I testing real behavior or mock existence?"

### Anti-Pattern: Mocking Without Understanding

```typescript
// BAD: Mock breaks test logic
test('detects duplicate server', () => {
  // Mock prevents config write that test depends on!
  vi.mock('ToolCatalog', () => ({
    discoverAndCacheTools: vi.fn().mockResolvedValue(undefined)
  }));
  await addServer(config);
  await addServer(config);  // Should throw - but won't!
});

// GOOD: Mock at correct level
test('detects duplicate server', () => {
  vi.mock('MCPServerManager'); // Just mock slow server startup
  await addServer(config);  // Config written
  await addServer(config);  // Duplicate detected
});
```

**Gate:** Before mocking, ask: "What side effects does this have? Does my test depend on them?"

### Anti-Pattern: Incomplete Mocks

Mock the COMPLETE data structure as it exists in reality:

```typescript
// BAD: Partial mock
const mockResponse = {
  status: 'success',
  data: { userId: '123' }
  // Missing: metadata that downstream code uses
};

// GOOD: Mirror real API
const mockResponse = {
  status: 'success',
  data: { userId: '123', name: 'Alice' },
  metadata: { requestId: 'req-789', timestamp: 1234567890 }
};
```

### When Mocks Become Too Complex

Warning signs:
- Mock setup longer than test logic
- Mocking everything to make test pass
- Test breaks when mock changes

> "As the number of mocks grows, the probability of testing the mock instead of the desired code goes up." — Codurance

Consider integration tests with real components — often simpler than elaborate mocks.

### Anti-Pattern: Test-Only Methods in Production

```typescript
// BAD: destroy() only used in tests
class Session {
  async destroy() { /* cleanup */ }
}

// GOOD: Test utilities handle cleanup
// test-utils/session-helpers.ts
export async function cleanupSession(session: Session) {
  const workspace = session.getWorkspaceInfo();
  if (workspace) {
    await workspaceManager.destroyWorkspace(workspace.id);
  }
}
```

**Gate:** Before adding any method to production class, ask: "Is this only used by tests?" If yes, put it in test utilities.

## Test Isolation

Tests should not depend on execution order. But isolation doesn't mean cleaning up everything.

### What to Clean Up

**Long-lived resources MUST be cleaned up:**
- Virtual machines, containers
- Kubernetes jobs, pods, deployments
- Cloud resources (instances, buckets)
- Background processes, daemons

**Prefer product tools for cleanup** when possible:
```typescript
afterAll(async () => {
  // Use the product's own cleanup mechanisms
  await deployment.delete();
  await job.terminate();
});
```

**Side-channel cleanup** when product tools aren't available:
```typescript
afterAll(async () => {
  // Direct cleanup when product doesn't provide it
  await exec('kubectl delete job test-job-123');
});
```

### What's OK to Leave

**Database artifacts are fine to leave around.** Trying to clean up test data perfectly is a fool's errand and makes multi-step integration tests nearly impossible.

- Test records in databases
- Log entries
- Cached data that expires

The database should handle its own lifecycle. Tests that require pristine state should create unique identifiers, not depend on cleanup.

### Preventing Order Dependencies

```typescript
// Use unique identifiers instead of depending on clean state
const testId = `test-${Date.now()}-${Math.random()}`;
const user = await createUser({ email: `${testId}@test.com` });
```

## Quick Reference

| Problem | Fix |
|---------|-----|
| Arbitrary setTimeout in tests | Use condition-based waiting |
| Assert on mock elements | Test real component or unmock |
| Mock third-party directly | Create wrapper, mock wrapper |
| Test-only methods in production | Move to test utilities |
| Mock without understanding | Understand dependencies first |
| Incomplete mocks | Mirror real API completely |
| Over-complex mocks | Consider integration tests |
| Long-lived resources left running | Clean up VMs, k8s jobs, cloud resources |

## Red Flags

**Stop and reconsider when you see:**
- Arbitrary `setTimeout`/`sleep` without justification
- Assertions on mock elements or test IDs
- Methods only called in test files
- Mock setup is >50% of test code
- "Mocking just to be safe"
- Test depends on another test running first
- Long-lived resources not cleaned up

## TDD Connection

TDD prevents most testing anti-patterns:
- Write test first → forces thinking about what you're testing
- Watch it fail → confirms test tests real behavior, not mocks
- Minimal implementation → no test-only methods creep in
- Real dependencies first → you see what test needs before mocking

## Property-Based Testing

For certain patterns, property-based testing provides stronger coverage than example-based tests. See `property-based-testing` skill for complete reference.

### When to Use PBT

| Pattern | Example | Why PBT |
|---------|---------|---------|
| Serialization pairs | `encode`/`decode`, `toJSON`/`fromJSON` | Roundtrip property catches edge cases |
| Normalizers | `sanitize`, `canonicalize`, `format` | Idempotence property ensures stability |
| Validators | `is_valid`, `validate` | Valid-after-normalize property |
| Pure functions | Business logic, calculations | Multiple properties verify contract |
| Sorting/ordering | `sort`, `rank`, `compare` | Ordering + idempotence properties |

### When NOT to Use PBT

- Simple CRUD without transformation
- UI/presentation logic
- Integration tests requiring external setup
- When specific examples suffice and edge cases are well-understood
- Prototyping with fluid requirements

### PBT Quality Gates

Before committing property-based tests:

- [ ] **Not tautological:** Assertion doesn't compare same expression (`sorted(xs) == sorted(xs)` tests nothing)
- [ ] **Strong property:** Not just "no crash" - aim for roundtrip, idempotence, or invariants
- [ ] **Not vacuous:** `assume()` calls don't filter out most inputs
- [ ] **Edge cases explicit:** Include `@example([])`, `@example([1])` decorators
- [ ] **No reimplementation:** Don't restate function logic in assertion (`assert add(a,b) == a+b`)
- [ ] **Realistic constraints:** Strategy matches real-world input constraints

Overview

This skill helps you write and review reliable tests for Python projects by focusing on test philosophy, condition-based waiting, mocking strategy, and test isolation. It emphasizes behavior-driven testing, pragmatic integration tests, and avoiding fragile timing or mock-dependent assertions. Use it to increase confidence in real behavior while keeping tests maintainable and fast.

How this skill works

The skill inspects test structure and offers patterns: Arrange-Act-Assert, one action per test, and condition-based waiting instead of arbitrary sleeps. It guides where to use real dependencies versus mocks, how to design wrappers for third-party clients, and which resources must be cleaned up versus those safe to leave. It also provides quick reference fixes for common anti-patterns and red flags to stop and reconsider.

When to use it

  • Writing new tests to ensure behavior rather than implementation details
  • Reviewing flaky CI tests that rely on setTimeout or fixed sleeps
  • Deciding whether to mock an external service or use a real instance
  • Designing cleanup procedures for long-lived resources (VMs, k8s, cloud)
  • Refactoring tests after production code changes to avoid brittle assertions

Best practices

  • Test behavior not implementation; refactors shouldn't break tests
  • Favor integration tests for confidence-to-cost ratio, unit tests for narrow logic
  • Wait for conditions with a poll/wait helper, never guess timing with sleeps
  • Mock your own wrappers, not third-party libraries directly
  • Avoid adding test-only methods to production code; use test utilities instead

Example use cases

  • Replace setTimeout-based waits with a waitFor(condition, description, timeout) helper
  • Create a thin wrapper around an HTTP client and mock that wrapper in tests
  • Write integration tests that exercise the real database but use unique identifiers to avoid order dependencies
  • Detect flaky tests by identifying assertions on mock elements or test IDs
  • Decide cleanup strategy: use product APIs for teardown, fall back to side-channel commands when necessary

FAQ

When is an arbitrary timeout acceptable in a test?

Only when the test verifies timing behavior (debounce, throttle, intervals) and the timeout is documented to match those intervals.

How do I know if my mocks are too complex?

If mock setup is longer than test logic, or tests break when mocks change, consider using real components or simpler integration tests.