home / skills / proffesor-for-testing / agentic-qe / test-data-management

test-data-management skill

/v3/assets/skills/test-data-management

This skill helps generate realistic, GDPR-compliant test data at scale, privatizing PII and enabling reliable, fast quality testing.

npx playbooks add skill proffesor-for-testing/agentic-qe --skill test-data-management

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
6.8 KB
---
name: test-data-management
description: "Strategic test data generation, management, and privacy compliance. Use when creating test data, handling PII, ensuring GDPR/CCPA compliance, or scaling data generation for realistic testing scenarios."
category: specialized-testing
priority: high
tokenEstimate: 1000
agents: [qe-test-data-architect, qe-test-executor, qe-security-scanner]
implementation_status: optimized
optimization_version: 1.0
last_optimized: 2025-12-02
dependencies: []
quick_reference_card: true
tags: [test-data, faker, synthetic, gdpr, pii, anonymization, factories]
---

# Test Data Management

<default_to_action>
When creating or managing test data:
1. NEVER use production PII directly
2. GENERATE synthetic data with faker libraries
3. ANONYMIZE production data if used (mask, hash)
4. ISOLATE test data (transactions, per-test cleanup)
5. SCALE with batch generation (10k+ records/sec)

**Quick Data Strategy:**
- Unit tests: Minimal data (just enough)
- Integration: Realistic data (full complexity)
- Performance: Volume data (10k+ records)

**Critical Success Factors:**
- 40% of test failures from inadequate data
- GDPR fines up to €20M for PII violations
- Never store production PII in test environments
</default_to_action>

## Quick Reference Card

### When to Use
- Creating test datasets
- Handling sensitive data
- Performance testing with volume
- GDPR/CCPA compliance

### Data Strategies
| Type | When | Size |
|------|------|------|
| **Minimal** | Unit tests | 1-10 records |
| **Realistic** | Integration | 100-1000 records |
| **Volume** | Performance | 10k+ records |
| **Edge cases** | Boundary testing | Targeted |

### Privacy Techniques
| Technique | Use Case |
|-----------|----------|
| **Synthetic** | Generate fake data (preferred) |
| **Masking** | j***@example.com |
| **Hashing** | Irreversible pseudonymization |
| **Tokenization** | Reversible with key |

---

## Synthetic Data Generation

```javascript
import { faker } from '@faker-js/faker';

// Seed for reproducibility
faker.seed(123);

function generateUser() {
  return {
    id: faker.string.uuid(),
    email: faker.internet.email(),
    firstName: faker.person.firstName(),
    lastName: faker.person.lastName(),
    phone: faker.phone.number(),
    address: {
      street: faker.location.streetAddress(),
      city: faker.location.city(),
      zip: faker.location.zipCode()
    },
    createdAt: faker.date.past()
  };
}

// Generate 1000 users
const users = Array.from({ length: 1000 }, generateUser);
```

---

## Test Data Builder Pattern

```typescript
class UserBuilder {
  private user: Partial<User> = {};

  asAdmin() {
    this.user.role = 'admin';
    this.user.permissions = ['read', 'write', 'delete'];
    return this;
  }

  asCustomer() {
    this.user.role = 'customer';
    this.user.permissions = ['read'];
    return this;
  }

  withEmail(email: string) {
    this.user.email = email;
    return this;
  }

  build(): User {
    return {
      id: this.user.id ?? faker.string.uuid(),
      email: this.user.email ?? faker.internet.email(),
      role: this.user.role ?? 'customer',
      ...this.user
    } as User;
  }
}

// Usage
const admin = new UserBuilder().asAdmin().withEmail('[email protected]').build();
const customer = new UserBuilder().asCustomer().build();
```

---

## Data Anonymization

```javascript
// Masking
function maskEmail(email) {
  const [user, domain] = email.split('@');
  return `${user[0]}***@${domain}`;
}
// [email protected] → j***@example.com

function maskCreditCard(cc) {
  return `****-****-****-${cc.slice(-4)}`;
}
// 4242424242424242 → ****-****-****-4242

// Anonymize production data
const anonymizedUsers = prodUsers.map(user => ({
  id: user.id, // Keep ID for relationships
  email: `user-${user.id}@example.com`, // Fake email
  firstName: faker.person.firstName(), // Generated
  phone: null, // Remove PII
  createdAt: user.createdAt // Keep non-PII
}));
```

---

## Database Transaction Isolation

```javascript
// Best practice: use transactions for cleanup
beforeEach(async () => {
  await db.beginTransaction();
});

afterEach(async () => {
  await db.rollbackTransaction(); // Auto cleanup!
});

test('user registration', async () => {
  const user = await userService.register({
    email: '[email protected]'
  });
  expect(user.id).toBeDefined();
  // Automatic rollback after test - no cleanup needed
});
```

---

## Volume Data Generation

```javascript
// Generate 10,000 users efficiently
async function generateLargeDataset(count = 10000) {
  const batchSize = 1000;
  const batches = Math.ceil(count / batchSize);

  for (let i = 0; i < batches; i++) {
    const users = Array.from({ length: batchSize }, (_, index) => ({
      id: i * batchSize + index,
      email: `user${i * batchSize + index}@example.com`,
      firstName: faker.person.firstName()
    }));

    await db.users.insertMany(users); // Batch insert
    console.log(`Batch ${i + 1}/${batches}`);
  }
}
```

---

## Agent-Driven Data Generation

```typescript
// High-speed generation with constraints
await Task("Generate Test Data", {
  schema: 'ecommerce',
  count: { users: 10000, products: 500, orders: 5000 },
  preserveReferentialIntegrity: true,
  constraints: {
    age: { min: 18, max: 90 },
    roles: ['customer', 'admin']
  }
}, "qe-test-data-architect");

// GDPR-compliant anonymization
await Task("Anonymize Production Data", {
  source: 'production-snapshot',
  piiFields: ['email', 'phone', 'ssn'],
  method: 'pseudonymization',
  retainStructure: true
}, "qe-test-data-architect");
```

---

## Agent Coordination Hints

### Memory Namespace
```
aqe/test-data-management/
├── schemas/*            - Data schemas
├── generators/*         - Generator configs
├── anonymization/*      - PII handling rules
└── fixtures/*           - Reusable fixtures
```

### Fleet Coordination
```typescript
const dataFleet = await FleetManager.coordinate({
  strategy: 'test-data-generation',
  agents: [
    'qe-test-data-architect',  // Generate data
    'qe-test-executor',        // Execute with data
    'qe-security-scanner'      // Validate no PII exposure
  ],
  topology: 'sequential'
});
```

---

## Related Skills
- [database-testing](../database-testing/) - Schema and integrity testing
- [compliance-testing](../compliance-testing/) - GDPR/CCPA compliance
- [performance-testing](../performance-testing/) - Volume data for perf tests

---

## Remember

**Test data is infrastructure, not an afterthought.** 40% of test failures are caused by inadequate test data. Poor data = poor tests.

**Never use production PII directly.** GDPR fines up to €20M or 4% of revenue. Always use synthetic data or properly anonymized production snapshots.

**With Agents:** `qe-test-data-architect` generates 10k+ records/sec with realistic patterns, relationships, and constraints. Agents ensure GDPR/CCPA compliance automatically and eliminate test data bottlenecks.

Overview

This skill provides strategic test data generation, management, and privacy-compliant workflows for quality engineering. It focuses on synthetic data generation, safe anonymization of production snapshots, scalable volume generation, and agent-driven orchestration. Use it to eliminate test-data bottlenecks while ensuring GDPR/CCPA compliance and realistic test coverage.

How this skill works

The skill generates synthetic datasets using faker-style generators and builder patterns to produce realistic records and edge cases. It includes anonymization techniques (masking, hashing, tokenization) for safely transforming production snapshots and supports transaction-based isolation to avoid persistent test state. Agents coordinate batch generation, preserve referential integrity, and enforce PII handling rules across environments.

When to use it

  • Creating realistic datasets for integration or system tests
  • Generating high-volume data for performance and load testing
  • Handling or transforming production snapshots while protecting PII
  • Scaling test data generation across CI pipelines and agent fleets
  • Testing edge cases and boundary conditions with targeted data

Best practices

  • Never use raw production PII in test environments; prefer synthetic or anonymized data
  • Use minimal datasets for unit tests, realistic sets for integration, and 10k+ records for performance
  • Seed generators for reproducibility and use builder patterns for flexible fixtures
  • Isolate tests with DB transactions or per-test cleanup to avoid state leakage
  • Batch inserts and parallelize generation for efficient volume data creation

Example use cases

  • Create 1,000 realistic users and orders for an integration test suite with preserved relationships
  • Anonymize a production snapshot by masking emails, hashing SSNs, and removing phone numbers
  • Generate 10k+ user records in batches for stress testing API throughput
  • Use a UserBuilder to create admin and customer personas for role-based tests
  • Coordinate agents to generate schema-conformant datasets and verify no PII leaks

FAQ

Can I use production data for tests?

Only if it is properly anonymized or pseudonymized; never insert raw production PII into test environments.

How do I scale to tens of thousands of records?

Use batch generation with inserts (e.g., 1k batch size), parallelize tasks, and preserve referential integrity during insertion.

What anonymization methods are recommended?

Prefer pseudonymization or irreversible hashing for sensitive fields, masking for display, and tokenization when reversible mapping is required.