home / skills / dkyazzentwatwa / chatgpt-skills / data-anonymizer

data-anonymizer skill

/data-anonymizer

This skill detects and masks PII in text and CSV data, offering multiple masking strategies and reversible tokenization for secure analysis.

npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill data-anonymizer

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
5.8 KB
---
name: data-anonymizer
description: Detect and mask PII (names, emails, phones, SSN, addresses) in text and CSV files. Multiple masking strategies with reversible tokenization option.
---

# Data Anonymizer

Detect and mask personally identifiable information (PII) in text documents and structured data. Supports multiple masking strategies and can process CSV files at scale.

## Quick Start

```python
from scripts.data_anonymizer import DataAnonymizer

# Anonymize text
anonymizer = DataAnonymizer()
result = anonymizer.anonymize("Contact John Smith at [email protected] or 555-123-4567")
print(result)
# "Contact [NAME] at [EMAIL] or [PHONE]"

# Anonymize CSV
anonymizer.anonymize_csv("customers.csv", "customers_anon.csv")
```

## Features

- **PII Detection**: Names, emails, phones, SSN, addresses, credit cards, dates
- **Multiple Strategies**: Mask, redact, hash, fake data replacement
- **CSV Processing**: Anonymize specific columns or auto-detect
- **Reversible Tokens**: Optional mapping for de-anonymization
- **Custom Patterns**: Add your own PII patterns
- **Audit Report**: List all detected PII with locations

## API Reference

### Initialization

```python
anonymizer = DataAnonymizer(
    strategy="mask",      # mask, redact, hash, fake
    reversible=False      # Enable token mapping
)
```

### Text Anonymization

```python
# Basic anonymization
result = anonymizer.anonymize(text)

# With specific PII types
result = anonymizer.anonymize(text, pii_types=["email", "phone"])

# Get detected PII report
result, report = anonymizer.anonymize(text, return_report=True)
```

### Masking Strategies

```python
text = "Email [email protected], call 555-1234"

# Mask (default) - replace with type labels
anonymizer.strategy = "mask"
# "Email [EMAIL], call [PHONE]"

# Redact - replace with asterisks
anonymizer.strategy = "redact"
# "Email ***************, call ********"

# Hash - replace with hash
anonymizer.strategy = "hash"
# "Email a1b2c3d4, call e5f6g7h8"

# Fake - replace with realistic fake data
anonymizer.strategy = "fake"
# "Email [email protected], call 555-9876"
```

### CSV Processing

```python
# Auto-detect PII columns
anonymizer.anonymize_csv("input.csv", "output.csv")

# Specify columns
anonymizer.anonymize_csv(
    "input.csv",
    "output.csv",
    columns=["name", "email", "phone"]
)

# Different strategies per column
anonymizer.anonymize_csv(
    "input.csv",
    "output.csv",
    column_strategies={
        "name": "fake",
        "email": "hash",
        "ssn": "redact"
    }
)
```

### Reversible Anonymization

```python
anonymizer = DataAnonymizer(reversible=True)

# Anonymize with token mapping
result = anonymizer.anonymize("John Smith: [email protected]")
mapping = anonymizer.get_mapping()

# Save mapping securely
anonymizer.save_mapping("mapping.json", encrypt=True, password="secret")

# Later, de-anonymize
anonymizer.load_mapping("mapping.json", password="secret")
original = anonymizer.deanonymize(result)
```

### Custom Patterns

```python
# Add custom PII pattern
anonymizer.add_pattern(
    name="employee_id",
    pattern=r"EMP-\d{6}",
    label="[EMPLOYEE_ID]"
)
```

## CLI Usage

```bash
# Anonymize text file
python data_anonymizer.py --input document.txt --output document_anon.txt

# Anonymize CSV
python data_anonymizer.py --input customers.csv --output customers_anon.csv

# Specific strategy
python data_anonymizer.py --input data.csv --output anon.csv --strategy fake

# Generate audit report
python data_anonymizer.py --input document.txt --report audit.json

# Specific PII types only
python data_anonymizer.py --input doc.txt --types email phone ssn
```

### CLI Arguments

| Argument | Description | Default |
|----------|-------------|---------|
| `--input` | Input file | Required |
| `--output` | Output file | Required |
| `--strategy` | Masking strategy | mask |
| `--types` | PII types to detect | all |
| `--columns` | CSV columns to process | auto |
| `--report` | Generate audit report | - |
| `--reversible` | Enable token mapping | False |

## Supported PII Types

| Type | Examples | Pattern |
|------|----------|---------|
| `name` | John Smith, Mary Johnson | NLP-based |
| `email` | [email protected] | Regex |
| `phone` | 555-123-4567, (555) 123-4567 | Regex |
| `ssn` | 123-45-6789 | Regex |
| `credit_card` | 4111-1111-1111-1111 | Regex + Luhn |
| `address` | 123 Main St, City, ST 12345 | NLP + Regex |
| `date_of_birth` | 01/15/1990, January 15, 1990 | Regex |
| `ip_address` | 192.168.1.1 | Regex |

## Examples

### Anonymize Customer Support Logs

```python
anonymizer = DataAnonymizer(strategy="mask")

log = """
Ticket #1234: Customer John Doe ([email protected]) called about
billing issue. SSN on file: 123-45-6789. Callback number: 555-867-5309.
Address: 123 Oak Street, Springfield, IL 62701.
"""

result = anonymizer.anonymize(log)
print(result)
# Ticket #1234: Customer [NAME] ([EMAIL]) called about
# billing issue. SSN on file: [SSN]. Callback number: [PHONE].
# Address: [ADDRESS].
```

### GDPR Compliance for Database Export

```python
anonymizer = DataAnonymizer(strategy="hash")

# Consistent hashing for joins
anonymizer.anonymize_csv(
    "users.csv",
    "users_anon.csv",
    columns=["email", "name", "phone"]
)

anonymizer.anonymize_csv(
    "orders.csv",
    "orders_anon.csv",
    columns=["customer_email"]  # Same hash as users.email
)
```

### Generate Test Data from Production

```python
anonymizer = DataAnonymizer(strategy="fake")

# Replace real PII with realistic fake data
anonymizer.anonymize_csv(
    "production_data.csv",
    "test_data.csv"
)

# Test data has same structure but fake PII
```

## Dependencies

```
pandas>=2.0.0
faker>=18.0.0
```

## Limitations

- Name detection may miss unusual names
- Address detection works best for US formats
- Custom patterns may be needed for domain-specific PII
- Fake data replacement doesn't preserve exact format

Overview

This skill detects and masks personally identifiable information (PII) in free text and CSV files. It supports multiple masking strategies (mask, redact, hash, fake) and optional reversible tokenization for safe re-identification. The tool is designed for GDPR/CCPA use cases, audit reporting, and generating realistic test data from production exports.

How this skill works

The analyzer scans input text or CSV columns using a mix of regex patterns and NLP-based heuristics to find names, emails, phones, SSNs, addresses, credit cards, IPs and dates. Detected items are replaced according to the chosen strategy (type labels, asterisks, deterministic hash, or realistic fake values). When reversible mode is enabled, the skill stores a secure token mapping that can be exported, encrypted, and later used to deanonymize data.

When to use it

  • Preparing database or log exports for external sharing or analytics
  • Removing PII from text documents, support tickets, or transcripts
  • Creating realistic test data from production while preserving referential integrity
  • Complying with data protection regulations before data processing or third-party transfers
  • Generating audit reports showing which PII was found and where

Best practices

  • Choose hashing for consistent pseudonyms across datasets to preserve joins while protecting raw identifiers
  • Use reversible mode only when you can securely store and encrypt the mapping file
  • Validate auto-detected CSV columns and specify critical columns explicitly to avoid under- or over-anonymization
  • Combine strategies per column (e.g., fake names, hash emails, redact SSNs) to balance utility and risk
  • Add custom regex patterns for domain-specific identifiers not covered by defaults

Example use cases

  • Anonymize customer support logs to remove names, emails, phones and addresses before sharing with an analytics team
  • Process CSV exports for GDPR requests by masking sensitive columns and producing an audit report
  • Generate non-production test datasets by replacing real PII with realistic fake values while maintaining schema and referential keys
  • Consistently hash email addresses across user and order datasets to enable privacy-preserving joins
  • Run a pre-release scan on documentation to redact credit cards and social security numbers

FAQ

Can I deanonymize data later?

Yes — enable reversible mode to create a token mapping that can be saved encrypted and used to restore originals.

Does CSV processing detect columns automatically?

Yes — the skill can auto-detect likely PII columns, but you should confirm or explicitly list columns for critical exports.

Which PII types are supported out of the box?

Built-in types include names, emails, phones, SSNs, addresses, credit cards, dates, and IPs; you can add custom patterns.