home / skills / dkyazzentwatwa / chatgpt-skills / dataset-comparer

dataset-comparer skill

/dataset-comparer

This skill compares two datasets to identify added, removed, and changed rows, with optional key-based matching and reports.

npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill dataset-comparer

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
7.5 KB
---
name: dataset-comparer
description: Compare two datasets to find differences, added/removed rows, changed values. Use for data validation, ETL verification, or tracking changes.
---

# Dataset Comparer

Compare two CSV/Excel datasets to identify differences, additions, deletions, and value changes.

## Features

- **Row Comparison**: Find added, removed, and matching rows
- **Value Changes**: Detect changed values in matching rows
- **Column Comparison**: Identify schema differences
- **Statistics**: Summary of differences
- **Diff Reports**: HTML, CSV, and JSON output
- **Flexible Matching**: Compare by key columns or row position

## Quick Start

```python
from dataset_comparer import DatasetComparer

comparer = DatasetComparer()
comparer.load("old_data.csv", "new_data.csv")

# Compare by key column
diff = comparer.compare(key_columns=["id"])

print(f"Added rows: {diff['added_count']}")
print(f"Removed rows: {diff['removed_count']}")
print(f"Changed rows: {diff['changed_count']}")

# Generate report
comparer.generate_report("diff_report.html")
```

## CLI Usage

```bash
# Basic comparison
python dataset_comparer.py --old old.csv --new new.csv

# Compare by key column
python dataset_comparer.py --old old.csv --new new.csv --key id

# Multiple key columns
python dataset_comparer.py --old old.csv --new new.csv --key id,date

# Generate HTML report
python dataset_comparer.py --old old.csv --new new.csv --key id --report diff.html

# Export differences to CSV
python dataset_comparer.py --old old.csv --new new.csv --key id --output diff.csv

# JSON output
python dataset_comparer.py --old old.csv --new new.csv --key id --json

# Ignore specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --ignore updated_at,modified_date

# Compare only specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --columns name,email,status
```

## API Reference

### DatasetComparer Class

```python
class DatasetComparer:
    def __init__(self)

    # Data loading
    def load(self, old_path: str, new_path: str) -> 'DatasetComparer'
    def load_dataframes(self, old_df: pd.DataFrame,
                       new_df: pd.DataFrame) -> 'DatasetComparer'

    # Comparison
    def compare(self, key_columns: list = None,
               ignore_columns: list = None,
               compare_columns: list = None) -> dict

    # Detailed results
    def get_added_rows(self) -> pd.DataFrame
    def get_removed_rows(self) -> pd.DataFrame
    def get_changed_rows(self) -> pd.DataFrame
    def get_unchanged_rows(self) -> pd.DataFrame

    # Schema comparison
    def compare_schema(self) -> dict

    # Reports
    def generate_report(self, output: str, format: str = "html") -> str
    def to_dataframe(self) -> pd.DataFrame
    def summary(self) -> str
```

## Comparison Methods

### Key-Based Comparison

Compare rows by matching key columns (like primary keys):

```python
diff = comparer.compare(key_columns=["customer_id"])

# Multiple keys for composite matching
diff = comparer.compare(key_columns=["order_id", "product_id"])
```

### Position-Based Comparison

Compare rows by their position (row number):

```python
diff = comparer.compare()  # No keys = positional comparison
```

## Output Format

### Comparison Result

```python
{
    "summary": {
        "old_rows": 1000,
        "new_rows": 1050,
        "added_count": 75,
        "removed_count": 25,
        "changed_count": 50,
        "unchanged_count": 900,
        "total_differences": 150
    },
    "schema_changes": {
        "added_columns": ["new_field"],
        "removed_columns": ["old_field"],
        "type_changes": [
            {"column": "amount", "old_type": "int64", "new_type": "float64"}
        ]
    },
    "key_columns": ["id"],
    "compared_columns": ["name", "email", "status"],
    "ignored_columns": ["updated_at"]
}
```

### Changed Row Details

```python
changes = comparer.get_changed_rows()

# Returns DataFrame with columns:
# _key: Key value(s) for the row
# _column: Column that changed
# _old_value: Original value
# _new_value: New value
```

## Schema Comparison

Compare column structure:

```python
schema = comparer.compare_schema()

# Returns:
{
    "old_columns": ["id", "name", "old_field"],
    "new_columns": ["id", "name", "new_field"],
    "common_columns": ["id", "name"],
    "added_columns": ["new_field"],
    "removed_columns": ["old_field"],
    "type_changes": [
        {"column": "price", "old_type": "int64", "new_type": "float64"}
    ],
    "old_row_count": 1000,
    "new_row_count": 1050
}
```

## Filtering Options

### Ignore Columns

Skip certain columns during comparison:

```python
diff = comparer.compare(
    key_columns=["id"],
    ignore_columns=["updated_at", "modified_by", "timestamp"]
)
```

### Compare Specific Columns

Only compare selected columns:

```python
diff = comparer.compare(
    key_columns=["id"],
    compare_columns=["name", "email", "status"]  # Only these columns
)
```

## Report Formats

### HTML Report

```python
comparer.generate_report("diff_report.html", format="html")
```

Features:
- Summary statistics
- Interactive tables
- Color-coded changes (green=added, red=removed, yellow=changed)
- Schema comparison section

### CSV Export

```python
comparer.generate_report("diff_report.csv", format="csv")
```

Includes all differences in tabular format.

### JSON Output

```python
comparer.generate_report("diff_report.json", format="json")
```

Complete diff data in JSON format.

## Example Workflows

### Data Migration Validation

```python
comparer = DatasetComparer()
comparer.load("source_data.csv", "migrated_data.csv")

diff = comparer.compare(key_columns=["id"])

if diff["summary"]["total_differences"] == 0:
    print("Migration successful - no differences!")
else:
    print(f"Found {diff['summary']['total_differences']} differences")
    comparer.generate_report("migration_issues.html")
```

### ETL Pipeline Verification

```python
comparer = DatasetComparer()
comparer.load("yesterday.csv", "today.csv")

diff = comparer.compare(
    key_columns=["transaction_id"],
    ignore_columns=["processing_timestamp"]
)

# Check for unexpected changes
changed = comparer.get_changed_rows()
if len(changed) > 0:
    print("Warning: Historical records changed!")
    print(changed)
```

### Incremental Update Detection

```python
comparer = DatasetComparer()
comparer.load("last_sync.csv", "current.csv")

diff = comparer.compare(key_columns=["customer_id"])

# Get new records for processing
new_records = comparer.get_added_rows()
print(f"New records to process: {len(new_records)}")

# Get deleted records
deleted = comparer.get_removed_rows()
print(f"Records to deactivate: {len(deleted)}")
```

### Schema Change Detection

```python
comparer = DatasetComparer()
comparer.load("v1_export.csv", "v2_export.csv")

schema = comparer.compare_schema()

if schema["added_columns"]:
    print(f"New columns: {schema['added_columns']}")

if schema["removed_columns"]:
    print(f"Removed columns: {schema['removed_columns']}")

if schema["type_changes"]:
    for change in schema["type_changes"]:
        print(f"Type change: {change['column']} "
              f"({change['old_type']} -> {change['new_type']})")
```

## Large Dataset Tips

For very large datasets:

```python
# Compare in chunks
comparer = DatasetComparer()
comparer.load("large_old.csv", "large_new.csv")

# Use key column for efficient matching
diff = comparer.compare(key_columns=["id"])

# Export only differences (not full data)
comparer.generate_report("diff_only.csv", format="csv")
```

## Dependencies

- pandas>=2.0.0
- numpy>=1.24.0

Overview

This skill compares two tabular datasets (CSV/Excel or DataFrame) to surface added, removed and changed rows, plus schema differences and summary statistics. It produces machine- and human-readable outputs (HTML, CSV, JSON) and supports both key-based and position-based matching for flexible validation. Use it to verify ETL jobs, migrations, and incremental updates quickly and reliably.

How this skill works

Load the old and new datasets or supply DataFrames, then run a comparison using optional key columns, ignored columns, or a selected subset of columns. The tool identifies added and removed rows, detects per-column value changes for matching rows, and compares schemas including column additions, removals, and type changes. Results are exposed as structured summaries, detailed changed-row records, and exportable diff reports.

When to use it

  • Validate data migrations and ensure no unintended changes after import or export.
  • Verify ETL pipeline outputs between stages or runs to catch regressions.
  • Detect incremental updates for downstream processing or synchronization tasks.
  • Track schema evolution across versions or data exports.
  • Generate audit-ready reports for data quality reviews or compliance.

Best practices

  • Prefer key-based comparisons using stable primary or composite keys for accurate matching.
  • Ignore volatile timestamp or processing columns to avoid noisy differences.
  • Limit compare_columns when you only care about a subset of fields to reduce noise.
  • Use export formats (CSV/JSON/HTML) to share findings with engineers or stakeholders.
  • For very large datasets, compare by key and export only the differences to save memory and time.

Example use cases

  • Post-migration verification: compare source and migrated tables and produce an HTML report of any mismatches.
  • ETL regression checks: run nightly job comparisons and alert if historical rows changed unexpectedly.
  • Incremental sync: identify newly added records to ingest and deleted records to deactivate.
  • Schema monitoring: detect when a new column appears or an existing column changes type.
  • Data auditing: produce CSV/JSON exports of all differences for investigation or ticketing.

FAQ

Can I compare by multiple key columns?

Yes. Provide a list of key columns for composite matching so rows are matched by the combined key.

How do I ignore certain columns?

Pass an ignore_columns list so those fields are excluded from value comparisons and won’t trigger changes.

What output formats are available?

HTML for interactive human reports, CSV for tabular exports, and JSON for programmatic consumption.

Does it support position-based comparison?

Yes. If you do not specify key columns, the comparison falls back to row-by-row positional matching.