home / skills / ed3dai / ed3d-plugins / writing-good-tests

writing-good-tests skill

needs review

/plugins/ed3d-house-style/skills/writing-good-tests

This skill helps you write robust tests by emphasizing integration over unit tests, waiting for real conditions, and using strategic mocks.

npx playbooks add skill ed3dai/ed3d-plugins --skill writing-good-tests

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

10.5 KB

---
name: writing-good-tests
description: Use when writing or reviewing tests - covers test philosophy, condition-based waiting, mocking strategy, and test isolation
user-invocable: false
---

# Writing Good Tests

## Philosophy

**"Write tests. Not too many. Mostly integration."** — Kent C. Dodds

Tests verify real behavior, not implementation details. The goal is confidence that your code works, not coverage numbers.

**Core principles:**
1. Test behavior, not implementation — refactoring shouldn't break tests
2. Integration tests provide better confidence-to-cost ratio than unit tests
3. Wait for actual conditions, not arbitrary timeouts
4. Mock strategically — real dependencies when feasible, mocks for external systems
5. Don't pollute production code with test-only methods

## Test Structure

Use **Arrange-Act-Assert** (or Given-When-Then):

```typescript
test('user can cancel reservation', async () => {
  // Arrange
  const reservation = await createReservation({ userId: 'user-1', roomId: 'room-1' });

  // Act
  const result = await cancelReservation(reservation.id);

  // Assert
  expect(result.status).toBe('cancelled');
  expect(await getReservation(reservation.id)).toBeNull();
});
```

**One action per test.** Multiple assertions are fine if they verify the same behavior.

## Condition-Based Waiting

Flaky tests often guess at timing. This creates race conditions where tests pass locally but fail in CI.

**Wait for conditions, not time:**

```typescript
// BAD: Guessing at timing
await new Promise(r => setTimeout(r, 50));
const result = getResult();

// GOOD: Waiting for condition
await waitFor(() => getResult() !== undefined);
const result = getResult();
```

### Generic Polling Function

```typescript
async function waitFor<T>(
  condition: () => T | undefined | null | false,
  description: string,
  timeoutMs = 5000
): Promise<T> {
  const startTime = Date.now();

  while (true) {
    const result = condition();
    if (result) return result;

    if (Date.now() - startTime > timeoutMs) {
      throw new Error(`Timeout waiting for ${description} after ${timeoutMs}ms`);
    }

    await new Promise(r => setTimeout(r, 10)); // Poll every 10ms
  }
}
```

### Quick Patterns

| Scenario | Pattern |
|----------|---------|
| Wait for event | `waitFor(() => events.find(e => e.type === 'DONE'))` |
| Wait for state | `waitFor(() => machine.state === 'ready')` |
| Wait for count | `waitFor(() => items.length >= 5)` |

### When Arbitrary Timeout IS Correct

Only when testing actual timing behavior (debounce, throttle, intervals):

```typescript
// Testing tool that ticks every 100ms
await waitForEvent(manager, 'TOOL_STARTED'); // First: wait for condition
await new Promise(r => setTimeout(r, 200));   // Then: wait for 2 ticks
// Comment explains WHY: 200ms = 2 ticks at 100ms intervals
```

## Mocking Strategy

> "You don't hate mocks; you hate side-effects." — J.B. Rainsberger

Mocks reveal where side-effects complicate your code. Use them strategically, not reflexively.

### Don't Mock What You Don't Own

Create thin wrappers around third-party libraries. Mock YOUR wrapper, not the library.

```typescript
// BAD: Mock the HTTP client directly
const mockClient = vi.mocked(httpx.Client);

// GOOD: Create your own wrapper
class RegistryClient {
  constructor(private client: HttpClient) {}
  async getRepos() {
    return this.client.get('https://registry.example.com/v2/_catalog');
  }
}

// Mock your wrapper
vi.mock('./registry-client');
```

This simplifies tests AND improves your design.

### Managed vs Unmanaged Dependencies

| Dependency Type | Example | Strategy |
|-----------------|---------|----------|
| **Managed** (you control it) | Your database, your file system | Use REAL instances |
| **Unmanaged** (external) | Third-party APIs, SMTP, message bus | Use MOCKS |

Communications with managed dependencies are implementation details — you can refactor them freely. Communications with unmanaged dependencies are observable behavior — mocking protects against external changes.

### Anti-Pattern: Testing Mock Behavior

```typescript
// BAD: Testing that the mock exists
test('renders sidebar', () => {
  render(<Page />);
  expect(screen.getByTestId('sidebar-mock')).toBeInTheDocument();
});

// GOOD: Test real behavior
test('renders sidebar', () => {
  render(<Page />);
  expect(screen.getByRole('navigation')).toBeInTheDocument();
});
```

**Gate:** Before asserting on any mock element, ask: "Am I testing real behavior or mock existence?"

### Anti-Pattern: Mocking Without Understanding

```typescript
// BAD: Mock breaks test logic
test('detects duplicate server', () => {
  // Mock prevents config write that test depends on!
  vi.mock('ToolCatalog', () => ({
    discoverAndCacheTools: vi.fn().mockResolvedValue(undefined)
  }));
  await addServer(config);
  await addServer(config);  // Should throw - but won't!
});

// GOOD: Mock at correct level
test('detects duplicate server', () => {
  vi.mock('MCPServerManager'); // Just mock slow server startup
  await addServer(config);  // Config written
  await addServer(config);  // Duplicate detected
});
```

**Gate:** Before mocking, ask: "What side effects does this have? Does my test depend on them?"

### Anti-Pattern: Incomplete Mocks

Mock the COMPLETE data structure as it exists in reality:

```typescript
// BAD: Partial mock
const mockResponse = {
  status: 'success',
  data: { userId: '123' }
  // Missing: metadata that downstream code uses
};

// GOOD: Mirror real API
const mockResponse = {
  status: 'success',
  data: { userId: '123', name: 'Alice' },
  metadata: { requestId: 'req-789', timestamp: 1234567890 }
};
```

### When Mocks Become Too Complex

Warning signs:
- Mock setup longer than test logic
- Mocking everything to make test pass
- Test breaks when mock changes

> "As the number of mocks grows, the probability of testing the mock instead of the desired code goes up." — Codurance

Consider integration tests with real components — often simpler than elaborate mocks.

### Anti-Pattern: Test-Only Methods in Production

```typescript
// BAD: destroy() only used in tests
class Session {
  async destroy() { /* cleanup */ }
}

// GOOD: Test utilities handle cleanup
// test-utils/session-helpers.ts
export async function cleanupSession(session: Session) {
  const workspace = session.getWorkspaceInfo();
  if (workspace) {
    await workspaceManager.destroyWorkspace(workspace.id);
  }
}
```

**Gate:** Before adding any method to production class, ask: "Is this only used by tests?" If yes, put it in test utilities.

## Test Isolation

Tests should not depend on execution order. But isolation doesn't mean cleaning up everything.

### What to Clean Up

**Long-lived resources MUST be cleaned up:**
- Virtual machines, containers
- Kubernetes jobs, pods, deployments
- Cloud resources (instances, buckets)
- Background processes, daemons

**Prefer product tools for cleanup** when possible:
```typescript
afterAll(async () => {
  // Use the product's own cleanup mechanisms
  await deployment.delete();
  await job.terminate();
});
```

**Side-channel cleanup** when product tools aren't available:
```typescript
afterAll(async () => {
  // Direct cleanup when product doesn't provide it
  await exec('kubectl delete job test-job-123');
});
```

### What's OK to Leave

**Database artifacts are fine to leave around.** Trying to clean up test data perfectly is a fool's errand and makes multi-step integration tests nearly impossible.

- Test records in databases
- Log entries
- Cached data that expires

The database should handle its own lifecycle. Tests that require pristine state should create unique identifiers, not depend on cleanup.

### Preventing Order Dependencies

```typescript
// Use unique identifiers instead of depending on clean state
const testId = `test-${Date.now()}-${Math.random()}`;
const user = await createUser({ email: `${testId}@test.com` });
```

## Quick Reference

| Problem | Fix |
|---------|-----|
| Arbitrary setTimeout in tests | Use condition-based waiting |
| Assert on mock elements | Test real component or unmock |
| Mock third-party directly | Create wrapper, mock wrapper |
| Test-only methods in production | Move to test utilities |
| Mock without understanding | Understand dependencies first |
| Incomplete mocks | Mirror real API completely |
| Over-complex mocks | Consider integration tests |
| Long-lived resources left running | Clean up VMs, k8s jobs, cloud resources |

## Red Flags

**Stop and reconsider when you see:**
- Arbitrary `setTimeout`/`sleep` without justification
- Assertions on mock elements or test IDs
- Methods only called in test files
- Mock setup is >50% of test code
- "Mocking just to be safe"
- Test depends on another test running first
- Long-lived resources not cleaned up

## TDD Connection

TDD prevents most testing anti-patterns:
- Write test first → forces thinking about what you're testing
- Watch it fail → confirms test tests real behavior, not mocks
- Minimal implementation → no test-only methods creep in
- Real dependencies first → you see what test needs before mocking

## Property-Based Testing

For certain patterns, property-based testing provides stronger coverage than example-based tests. See `property-based-testing` skill for complete reference.

### When to Use PBT

| Pattern | Example | Why PBT |
|---------|---------|---------|
| Serialization pairs | `encode`/`decode`, `toJSON`/`fromJSON` | Roundtrip property catches edge cases |
| Normalizers | `sanitize`, `canonicalize`, `format` | Idempotence property ensures stability |
| Validators | `is_valid`, `validate` | Valid-after-normalize property |
| Pure functions | Business logic, calculations | Multiple properties verify contract |
| Sorting/ordering | `sort`, `rank`, `compare` | Ordering + idempotence properties |

### When NOT to Use PBT

- Simple CRUD without transformation
- UI/presentation logic
- Integration tests requiring external setup
- When specific examples suffice and edge cases are well-understood
- Prototyping with fluid requirements

### PBT Quality Gates

Before committing property-based tests:

- [ ] **Not tautological:** Assertion doesn't compare same expression (`sorted(xs) == sorted(xs)` tests nothing)
- [ ] **Strong property:** Not just "no crash" - aim for roundtrip, idempotence, or invariants
- [ ] **Not vacuous:** `assume()` calls don't filter out most inputs
- [ ] **Edge cases explicit:** Include `@example([])`, `@example([1])` decorators
- [ ] **No reimplementation:** Don't restate function logic in assertion (`assert add(a,b) == a+b`)
- [ ] **Realistic constraints:** Strategy matches real-world input constraints

Overview

This skill helps you write and review reliable tests for Python projects by focusing on test philosophy, condition-based waiting, mocking strategy, and test isolation. It emphasizes behavior-driven testing, pragmatic integration tests, and avoiding fragile timing or mock-dependent assertions. Use it to increase confidence in real behavior while keeping tests maintainable and fast.

How this skill works

The skill inspects test structure and offers patterns: Arrange-Act-Assert, one action per test, and condition-based waiting instead of arbitrary sleeps. It guides where to use real dependencies versus mocks, how to design wrappers for third-party clients, and which resources must be cleaned up versus those safe to leave. It also provides quick reference fixes for common anti-patterns and red flags to stop and reconsider.

When to use it

Writing new tests to ensure behavior rather than implementation details
Reviewing flaky CI tests that rely on setTimeout or fixed sleeps
Deciding whether to mock an external service or use a real instance
Designing cleanup procedures for long-lived resources (VMs, k8s, cloud)
Refactoring tests after production code changes to avoid brittle assertions

Best practices

Test behavior not implementation; refactors shouldn't break tests
Favor integration tests for confidence-to-cost ratio, unit tests for narrow logic
Wait for conditions with a poll/wait helper, never guess timing with sleeps
Mock your own wrappers, not third-party libraries directly
Avoid adding test-only methods to production code; use test utilities instead

Example use cases

Replace setTimeout-based waits with a waitFor(condition, description, timeout) helper
Create a thin wrapper around an HTTP client and mock that wrapper in tests
Write integration tests that exercise the real database but use unique identifiers to avoid order dependencies
Detect flaky tests by identifying assertions on mock elements or test IDs
Decide cleanup strategy: use product APIs for teardown, fall back to side-channel commands when necessary

FAQ

When is an arbitrary timeout acceptable in a test?

Only when the test verifies timing behavior (debounce, throttle, intervals) and the timeout is documented to match those intervals.

How do I know if my mocks are too complex?

If mock setup is longer than test logic, or tests break when mocks change, consider using real components or simpler integration tests.