home / skills / dkyazzentwatwa / chatgpt-skills / dataset-comparer
This skill compares two datasets to identify added, removed, and changed rows, with optional key-based matching and reports.
npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill dataset-comparerReview the files below or copy the command above to add this skill to your agents.
---
name: dataset-comparer
description: Compare two datasets to find differences, added/removed rows, changed values. Use for data validation, ETL verification, or tracking changes.
---
# Dataset Comparer
Compare two CSV/Excel datasets to identify differences, additions, deletions, and value changes.
## Features
- **Row Comparison**: Find added, removed, and matching rows
- **Value Changes**: Detect changed values in matching rows
- **Column Comparison**: Identify schema differences
- **Statistics**: Summary of differences
- **Diff Reports**: HTML, CSV, and JSON output
- **Flexible Matching**: Compare by key columns or row position
## Quick Start
```python
from dataset_comparer import DatasetComparer
comparer = DatasetComparer()
comparer.load("old_data.csv", "new_data.csv")
# Compare by key column
diff = comparer.compare(key_columns=["id"])
print(f"Added rows: {diff['added_count']}")
print(f"Removed rows: {diff['removed_count']}")
print(f"Changed rows: {diff['changed_count']}")
# Generate report
comparer.generate_report("diff_report.html")
```
## CLI Usage
```bash
# Basic comparison
python dataset_comparer.py --old old.csv --new new.csv
# Compare by key column
python dataset_comparer.py --old old.csv --new new.csv --key id
# Multiple key columns
python dataset_comparer.py --old old.csv --new new.csv --key id,date
# Generate HTML report
python dataset_comparer.py --old old.csv --new new.csv --key id --report diff.html
# Export differences to CSV
python dataset_comparer.py --old old.csv --new new.csv --key id --output diff.csv
# JSON output
python dataset_comparer.py --old old.csv --new new.csv --key id --json
# Ignore specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --ignore updated_at,modified_date
# Compare only specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --columns name,email,status
```
## API Reference
### DatasetComparer Class
```python
class DatasetComparer:
def __init__(self)
# Data loading
def load(self, old_path: str, new_path: str) -> 'DatasetComparer'
def load_dataframes(self, old_df: pd.DataFrame,
new_df: pd.DataFrame) -> 'DatasetComparer'
# Comparison
def compare(self, key_columns: list = None,
ignore_columns: list = None,
compare_columns: list = None) -> dict
# Detailed results
def get_added_rows(self) -> pd.DataFrame
def get_removed_rows(self) -> pd.DataFrame
def get_changed_rows(self) -> pd.DataFrame
def get_unchanged_rows(self) -> pd.DataFrame
# Schema comparison
def compare_schema(self) -> dict
# Reports
def generate_report(self, output: str, format: str = "html") -> str
def to_dataframe(self) -> pd.DataFrame
def summary(self) -> str
```
## Comparison Methods
### Key-Based Comparison
Compare rows by matching key columns (like primary keys):
```python
diff = comparer.compare(key_columns=["customer_id"])
# Multiple keys for composite matching
diff = comparer.compare(key_columns=["order_id", "product_id"])
```
### Position-Based Comparison
Compare rows by their position (row number):
```python
diff = comparer.compare() # No keys = positional comparison
```
## Output Format
### Comparison Result
```python
{
"summary": {
"old_rows": 1000,
"new_rows": 1050,
"added_count": 75,
"removed_count": 25,
"changed_count": 50,
"unchanged_count": 900,
"total_differences": 150
},
"schema_changes": {
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "amount", "old_type": "int64", "new_type": "float64"}
]
},
"key_columns": ["id"],
"compared_columns": ["name", "email", "status"],
"ignored_columns": ["updated_at"]
}
```
### Changed Row Details
```python
changes = comparer.get_changed_rows()
# Returns DataFrame with columns:
# _key: Key value(s) for the row
# _column: Column that changed
# _old_value: Original value
# _new_value: New value
```
## Schema Comparison
Compare column structure:
```python
schema = comparer.compare_schema()
# Returns:
{
"old_columns": ["id", "name", "old_field"],
"new_columns": ["id", "name", "new_field"],
"common_columns": ["id", "name"],
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "price", "old_type": "int64", "new_type": "float64"}
],
"old_row_count": 1000,
"new_row_count": 1050
}
```
## Filtering Options
### Ignore Columns
Skip certain columns during comparison:
```python
diff = comparer.compare(
key_columns=["id"],
ignore_columns=["updated_at", "modified_by", "timestamp"]
)
```
### Compare Specific Columns
Only compare selected columns:
```python
diff = comparer.compare(
key_columns=["id"],
compare_columns=["name", "email", "status"] # Only these columns
)
```
## Report Formats
### HTML Report
```python
comparer.generate_report("diff_report.html", format="html")
```
Features:
- Summary statistics
- Interactive tables
- Color-coded changes (green=added, red=removed, yellow=changed)
- Schema comparison section
### CSV Export
```python
comparer.generate_report("diff_report.csv", format="csv")
```
Includes all differences in tabular format.
### JSON Output
```python
comparer.generate_report("diff_report.json", format="json")
```
Complete diff data in JSON format.
## Example Workflows
### Data Migration Validation
```python
comparer = DatasetComparer()
comparer.load("source_data.csv", "migrated_data.csv")
diff = comparer.compare(key_columns=["id"])
if diff["summary"]["total_differences"] == 0:
print("Migration successful - no differences!")
else:
print(f"Found {diff['summary']['total_differences']} differences")
comparer.generate_report("migration_issues.html")
```
### ETL Pipeline Verification
```python
comparer = DatasetComparer()
comparer.load("yesterday.csv", "today.csv")
diff = comparer.compare(
key_columns=["transaction_id"],
ignore_columns=["processing_timestamp"]
)
# Check for unexpected changes
changed = comparer.get_changed_rows()
if len(changed) > 0:
print("Warning: Historical records changed!")
print(changed)
```
### Incremental Update Detection
```python
comparer = DatasetComparer()
comparer.load("last_sync.csv", "current.csv")
diff = comparer.compare(key_columns=["customer_id"])
# Get new records for processing
new_records = comparer.get_added_rows()
print(f"New records to process: {len(new_records)}")
# Get deleted records
deleted = comparer.get_removed_rows()
print(f"Records to deactivate: {len(deleted)}")
```
### Schema Change Detection
```python
comparer = DatasetComparer()
comparer.load("v1_export.csv", "v2_export.csv")
schema = comparer.compare_schema()
if schema["added_columns"]:
print(f"New columns: {schema['added_columns']}")
if schema["removed_columns"]:
print(f"Removed columns: {schema['removed_columns']}")
if schema["type_changes"]:
for change in schema["type_changes"]:
print(f"Type change: {change['column']} "
f"({change['old_type']} -> {change['new_type']})")
```
## Large Dataset Tips
For very large datasets:
```python
# Compare in chunks
comparer = DatasetComparer()
comparer.load("large_old.csv", "large_new.csv")
# Use key column for efficient matching
diff = comparer.compare(key_columns=["id"])
# Export only differences (not full data)
comparer.generate_report("diff_only.csv", format="csv")
```
## Dependencies
- pandas>=2.0.0
- numpy>=1.24.0
This skill compares two tabular datasets (CSV/Excel or DataFrame) to surface added, removed and changed rows, plus schema differences and summary statistics. It produces machine- and human-readable outputs (HTML, CSV, JSON) and supports both key-based and position-based matching for flexible validation. Use it to verify ETL jobs, migrations, and incremental updates quickly and reliably.
Load the old and new datasets or supply DataFrames, then run a comparison using optional key columns, ignored columns, or a selected subset of columns. The tool identifies added and removed rows, detects per-column value changes for matching rows, and compares schemas including column additions, removals, and type changes. Results are exposed as structured summaries, detailed changed-row records, and exportable diff reports.
Can I compare by multiple key columns?
Yes. Provide a list of key columns for composite matching so rows are matched by the combined key.
How do I ignore certain columns?
Pass an ignore_columns list so those fields are excluded from value comparisons and won’t trigger changes.
What output formats are available?
HTML for interactive human reports, CSV for tabular exports, and JSON for programmatic consumption.
Does it support position-based comparison?
Yes. If you do not specify key columns, the comparison falls back to row-by-row positional matching.