home / skills / aig787 / agpm / csv-data-auditor

csv-data-auditor skill

safe

This skill audits CSV data for quality, consistency, and completeness, helping you identify missing values, duplicates, and format issues efficiently.

npx playbooks add skill aig787/agpm --skill csv-data-auditor

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

5.7 KB

---
name: csv-data-auditor
description: Validate and audit CSV data for quality, consistency, and completeness. Use when you need to check CSV files for data issues, missing values, or format inconsistencies.
---

# CSV Data Auditor

## Instructions

When auditing CSV data, perform these validation checks systematically:

### 1. File Structure Validation
- Verify the file exists and is readable
- Check if the file is a valid CSV format
- Ensure consistent delimiter usage
- Verify proper quoting and escaping
- Check for balanced quotes

### 2. Header Analysis
- Confirm headers exist (unless headerless CSV)
- Check for duplicate header names
- Validate header naming conventions
- Ensure headers are descriptive and consistent
- Check for special characters in headers

### 3. Row Consistency
- Count total rows and columns
- Verify all rows have the same number of columns
- Check for empty rows or rows with only whitespace
- Identify truncated rows
- Detect corrupted or malformed rows

### 4. Data Type Validation
For each column, validate the expected data type:

#### Numeric Fields
- Check for non-numeric values
- Validate decimal point usage
- Look for negative values where inappropriate
- Check for extremely large/small values

#### Date/Time Fields
- Verify date format consistency (ISO 8601, US, EU, etc.)
- Check for invalid dates (e.g., February 30th)
- Validate time zone handling
- Look for future dates where inappropriate

#### Text Fields
- Check for encoding issues
- Look for unexpected special characters
- Verify string length constraints
- Check for leading/trailing whitespace

#### Boolean Fields
- Validate true/false representations
- Check for 1/0, Y/N, Yes/No consistency
- Look for ambiguous values

### 5. Data Quality Checks

#### Missing Values
- Count NULL/empty values per column
- Calculate missing value percentages
- Identify patterns in missing data
- Check for placeholders like "N/A", "null", "-"

#### Duplicates
- Find exact duplicate rows
- Check for duplicate IDs or keys
- Identify near-duplicates (typos, variations)
- Validate uniqueness constraints

#### Outliers
- Identify statistical outliers in numeric columns
- Check for values outside expected ranges
- Look for anomalies in categorical data
- Validate business logic constraints

#### Consistency Checks
- Cross-validate related columns
- Check referential integrity
- Validate business rules
- Look for logical contradictions

### 6. Validation Script (Python)

Create a Python script to perform automated checks:

```python
import pandas as pd
import numpy as np
from datetime import datetime

def audit_csv(file_path, delimiter=','):
    """Perform comprehensive CSV audit"""

    # Load CSV
    try:
        df = pd.read_csv(file_path, delimiter=delimiter)
    except Exception as e:
        return {"error": f"Failed to load CSV: {str(e)}"}

    audit_report = {
        "file_info": {
            "rows": len(df),
            "columns": len(df.columns),
            "headers": list(df.columns)
        },
        "issues": []
    }

    # Check for missing values
    missing_counts = df.isnull().sum()
    for col, count in missing_counts.items():
        if count > 0:
            percentage = (count / len(df)) * 100
            audit_report["issues"].append({
                "type": "missing_values",
                "column": col,
                "count": count,
                "percentage": round(percentage, 2)
            })

    # Check for duplicates
    duplicate_rows = df.duplicated().sum()
    if duplicate_rows > 0:
        audit_report["issues"].append({
            "type": "duplicates",
            "count": duplicate_rows
        })

    # Data type validation
    for col in df.columns:
        # Check for mixed types in object columns
        if df[col].dtype == 'object':
            sample_values = df[col].dropna().head(10)
            # Add specific type checks based on column name patterns

    return audit_report
```

### 7. Report Format

Generate a structured audit report:

```markdown
# CSV Audit Report

## File Information
- File: data.csv
- Size: 15.2 MB
- Rows: 50,000
- Columns: 12
- Headers: id, name, email, age, join_date, salary, department, ...

## Issues Found

### Critical Issues
1. **Missing Values**:
   - `email`: 2,500 missing (5.0%)
   - `salary`: 150 missing (0.3%)

### Warnings
1. **Inconsistent Date Format**:
   - `join_date`: Mix of ISO and US formats detected
   - Examples: 2023-01-15, 01/15/2023, 15-Jan-2023

2. **Potential Outliers**:
   - `age`: Values 0 and 150 detected
   - `salary`: Extremely high values > $1M

### Recommendations
1. Clean up email field - contact data source
2. Standardize date format to ISO 8601
3. Validate age and salary ranges
4. Remove or investigate duplicate rows

## Summary
- Overall Quality: Good
- Issues to Fix: 3 critical, 5 warnings
- Estimated Fix Time: 2-3 hours
```

### 8. Common Issues and Solutions

#### Encoding Problems
- Try different encodings (UTF-8, Latin-1, Windows-1252)
- Use `chardet` library to detect encoding
- Handle BOM (Byte Order Mark) if present

#### Large Files
- Process in chunks for memory efficiency
- Use Dask for out-of-core processing
- Sample data for quick validation

#### Complex CSVs
- Handle quoted fields with embedded delimiters
- Process multi-line records carefully
- Validate escape character usage

## Usage

1. Provide the CSV file path
2. Specify delimiter if not comma
3. Configure validation rules based on your data
4. Review the generated audit report
5. Address issues systematically

## Tips

- Always keep a backup of original data
- Document any data transformations
- Create validation rules based on business requirements
- Consider using schema validation tools like Great Expectations
- Automate regular audits for production data

Overview

This skill validates and audits CSV data for quality, consistency, and completeness. It performs structural checks, header and row consistency analysis, data type validation, and data-quality diagnostics to produce a structured audit report. The output highlights critical issues, warnings, and actionable recommendations. Use it to quickly assess CSV readiness for analysis or ingestion pipelines.

How this skill works

The auditor inspects file-level characteristics (existence, encoding, delimiter, quoting) and verifies header integrity and row consistency. It samples and analyzes each column for expected data types (numeric, date/time, text, boolean) and computes missing value statistics, duplicates, outliers, and cross-column consistency checks. The tool can run in-memory or process large files in chunks, and it emits a structured report with counts, examples, and recommended fixes. A simple Python validation script can be used or adapted to automate checks and integrate into pipelines.

When to use it

Before importing CSVs into databases or analytics systems
As part of an ETL/ingestion validation step for production pipelines
When preparing datasets for machine learning or reporting
To investigate data quality issues reported by consumers
When standardizing formats across multiple CSV sources

Best practices

Always detect and specify encoding; try UTF-8, Latin-1, or use a detector like chardet
Validate and normalize headers: remove duplicates, enforce naming conventions, and strip special characters
Process large files in chunks or with Dask to avoid memory exhaustion
Standardize date/time formats (prefer ISO 8601) and normalize boolean representations
Document transformations, keep originals, and automate periodic audits

Example use cases

Audit a monthly export from a CRM to find missing contact emails and duplicate IDs
Validate product feed CSVs for an e-commerce ETL before loading into inventory systems
Scan survey response files for inconsistent date formats and truncated rows
Preflight a large dataset for a machine-learning job to detect outliers and null-heavy columns
Integrate into CI pipelines to reject malformed CSV uploads

FAQ

How does the auditor handle very large CSV files?

Process in configurable chunks or use out-of-core libraries like Dask; sample early rows for quick checks and run full scans for targeted columns.

Can it detect encoding problems automatically?

It can attempt common encodings and suggest using an encoding detector (chardet) or checking for BOMs; explicit encoding input is recommended for reliability.