home / skills / dkyazzentwatwa / chatgpt-skills / data-quality-auditor

data-quality-auditor skill

/data-quality-auditor

This skill assesses data quality across completeness, validity, and consistency to speed up ETL validation and trustworthy reports.

npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill data-quality-auditor

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
5.9 KB
---
name: data-quality-auditor
description: Assess data quality with checks for missing values, duplicates, type issues, and inconsistencies. Use for data validation, ETL pipelines, or dataset documentation.
---

# Data Quality Auditor

Comprehensive data quality assessment for CSV/Excel datasets.

## Features

- **Completeness**: Missing values analysis
- **Uniqueness**: Duplicate detection
- **Validity**: Type validation and constraints
- **Consistency**: Pattern and format checks
- **Quality Score**: Overall data quality metric
- **Reports**: Detailed HTML/JSON reports

## Quick Start

```python
from data_quality_auditor import DataQualityAuditor

auditor = DataQualityAuditor()
auditor.load_csv("customers.csv")

# Run full audit
report = auditor.audit()
print(f"Quality Score: {report['quality_score']}/100")

# Check specific issues
missing = auditor.check_missing()
duplicates = auditor.check_duplicates()
```

## CLI Usage

```bash
# Full audit
python data_quality_auditor.py --input data.csv

# Generate HTML report
python data_quality_auditor.py --input data.csv --report report.html

# Check specific aspects
python data_quality_auditor.py --input data.csv --missing
python data_quality_auditor.py --input data.csv --duplicates
python data_quality_auditor.py --input data.csv --types

# JSON output
python data_quality_auditor.py --input data.csv --json

# Validate against rules
python data_quality_auditor.py --input data.csv --rules rules.json
```

## API Reference

### DataQualityAuditor Class

```python
class DataQualityAuditor:
    def __init__(self)

    # Data loading
    def load_csv(self, filepath: str, **kwargs) -> 'DataQualityAuditor'
    def load_dataframe(self, df: pd.DataFrame) -> 'DataQualityAuditor'

    # Full audit
    def audit(self) -> dict
    def quality_score(self) -> float

    # Individual checks
    def check_missing(self) -> dict
    def check_duplicates(self, subset: list = None) -> dict
    def check_types(self) -> dict
    def check_uniqueness(self) -> dict
    def check_patterns(self, column: str, pattern: str) -> dict

    # Validation
    def validate_column(self, column: str, rules: dict) -> dict
    def validate_dataset(self, rules: dict) -> dict

    # Reports
    def generate_report(self, output: str, format: str = "html") -> str
    def summary(self) -> str
```

## Quality Checks

### Missing Values
```python
missing = auditor.check_missing()
# Returns:
{
    "total_cells": 10000,
    "missing_cells": 150,
    "missing_percent": 1.5,
    "by_column": {
        "email": {"count": 50, "percent": 5.0},
        "phone": {"count": 100, "percent": 10.0}
    },
    "rows_with_missing": 120
}
```

### Duplicates
```python
dups = auditor.check_duplicates()
# Returns:
{
    "total_rows": 1000,
    "duplicate_rows": 25,
    "duplicate_percent": 2.5,
    "duplicate_groups": [...],
    "by_columns": {
        "email": {"duplicates": 15},
        "phone": {"duplicates": 20}
    }
}
```

### Type Validation
```python
types = auditor.check_types()
# Returns:
{
    "columns": {
        "age": {
            "detected_type": "int64",
            "unique_values": 75,
            "sample_values": [25, 30, 45],
            "issues": []
        },
        "date": {
            "detected_type": "object",
            "unique_values": 365,
            "sample_values": ["2023-01-01", "invalid"],
            "issues": ["Mixed date formats detected"]
        }
    }
}
```

## Validation Rules

Define custom validation rules:

```json
{
    "columns": {
        "email": {
            "required": true,
            "unique": true,
            "pattern": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
        },
        "age": {
            "type": "integer",
            "min": 0,
            "max": 120
        },
        "status": {
            "allowed_values": ["active", "inactive", "pending"]
        },
        "created_at": {
            "type": "date",
            "format": "%Y-%m-%d"
        }
    }
}
```

```python
results = auditor.validate_dataset(rules)
```

## Quality Score

The quality score (0-100) is calculated from:
- **Completeness** (30%): Missing value ratio
- **Uniqueness** (25%): Duplicate row ratio
- **Validity** (25%): Type and constraint compliance
- **Consistency** (20%): Format and pattern adherence

```python
score = auditor.quality_score()
# 85.5
```

## Output Formats

### Audit Report
```python
{
    "file": "data.csv",
    "rows": 1000,
    "columns": 15,
    "quality_score": 85.5,
    "completeness": {
        "score": 92.0,
        "missing_cells": 800,
        "details": {...}
    },
    "uniqueness": {
        "score": 97.5,
        "duplicate_rows": 25,
        "details": {...}
    },
    "validity": {
        "score": 78.0,
        "type_issues": [...],
        "details": {...}
    },
    "consistency": {
        "score": 80.0,
        "pattern_issues": [...],
        "details": {...}
    },
    "recommendations": [
        "Column 'phone' has 10% missing values",
        "25 duplicate rows detected",
        "Column 'date' has inconsistent formats"
    ]
}
```

## Example Workflows

### Pre-Import Validation
```python
auditor = DataQualityAuditor()
auditor.load_csv("import_data.csv")

report = auditor.audit()
if report['quality_score'] < 80:
    print("Data quality below threshold!")
    for rec in report['recommendations']:
        print(f"  - {rec}")
    exit(1)
```

### ETL Pipeline Check
```python
auditor = DataQualityAuditor()
auditor.load_dataframe(transformed_df)

# Check critical columns
email_check = auditor.validate_column("email", {
    "required": True,
    "unique": True,
    "pattern": r"^[\w.+-]+@[\w-]+\.[\w.-]+$"
})

if email_check['issues']:
    raise ValueError(f"Email validation failed: {email_check['issues']}")
```

### Generate Documentation
```python
auditor = DataQualityAuditor()
auditor.load_csv("dataset.csv")

# Generate comprehensive report
auditor.generate_report("quality_report.html", format="html")

# Or get summary text
print(auditor.summary())
```

## Dependencies

- pandas>=2.0.0
- numpy>=1.24.0

Overview

This skill provides a comprehensive data quality auditor for CSV and DataFrame datasets. It inspects completeness, uniqueness, validity, and consistency and produces a numeric quality score plus detailed HTML/JSON reports. Use it to validate data before import, in ETL pipelines, or to generate dataset documentation.

How this skill works

The auditor loads data from CSV or a pandas DataFrame and runs modular checks: missing values, duplicate detection, type validation, pattern/format consistency, and custom rule validation. Each check returns structured results and contributes to a weighted quality score (completeness 30%, uniqueness 25%, validity 25%, consistency 20%). Reports and summaries can be generated as JSON or HTML for downstream automation or human review.

When to use it

  • Pre-import validation to block poor-quality feeds
  • Automated ETL pipelines to gate transformations
  • Data profiling and documentation for new datasets
  • Monitoring data health in scheduled jobs
  • Validating user-submitted or third-party data files

Best practices

  • Define and store validation rules for critical columns (types, patterns, allowed values)
  • Treat quality score thresholds as gating criteria in CI/ETL workflows
  • Run targeted checks (e.g., check_duplicates or check_patterns) during fast feedback loops
  • Persist JSON reports for auditability and use HTML for stakeholder-friendly summaries
  • Combine column-level rules with global policies (required, unique) for robust validation

Example use cases

  • Pre-import checklist: run audit() and abort imports when quality_score < threshold
  • ETL checkpoint: validate transformed DataFrame with validate_column before loading to warehouse
  • Data onboarding: generate an HTML report for new datasets to highlight issues and remediation steps
  • Automated monitoring: schedule audit runs and alert on falling completeness or rising duplicates
  • Documentation: produce JSON summaries for dataset catalogs and data contracts

FAQ

What input formats are supported?

Load CSV files or pass a pandas.DataFrame directly; CSV is supported via load_csv and DataFrame via load_dataframe.

How is the quality score calculated?

A weighted combination of completeness (30%), uniqueness (25%), validity (25%), and consistency (20%), each derived from the respective checks and issue counts.

Can I enforce custom validation rules?

Yes. Define rules for columns (required, unique, type, pattern, allowed_values, min/max, date format) and run validate_dataset or validate_column to get detailed rule compliance.