home / skills / pluginagentmarketplace / custom-plugin-data-analyst / data-cleaning

data-cleaning skill

/skills/data-cleaning

This skill helps you clean, validate, and preprocess datasets to improve analytics quality through missing value handling, outlier treatment, and

npx playbooks add skill pluginagentmarketplace/custom-plugin-data-analyst --skill data-cleaning

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
1.6 KB
---
name: data-cleaning
description: Data cleaning, preprocessing, and quality assurance techniques
version: "2.0.0"
sasmp_version: "2.0.0"
bonded_agent: 05-programming-expert
bond_type: SECONDARY_BOND

# Skill Configuration
config:
  atomic: true
  retry_enabled: true
  max_retries: 3
  backoff_strategy: exponential

# Parameter Validation
parameters:
  tool_preference:
    type: string
    required: true
    enum: [python, r, excel, sql]
    default: python
  data_size:
    type: string
    required: false
    enum: [small, medium, large]
    default: medium

# Observability
observability:
  logging_level: info
  metrics: [rows_cleaned, missing_handled, duplicates_removed]
---

# Data Cleaning Skill

## Overview
Master data cleaning and preprocessing techniques essential for reliable analytics.

## Topics Covered
- Missing value handling (imputation, deletion)
- Outlier detection and treatment
- Data type conversion and validation
- Duplicate identification and removal
- String cleaning and normalization

## Learning Outcomes
- Clean messy datasets
- Handle missing data appropriately
- Detect and treat outliers
- Ensure data quality

## Error Handling

| Error Type | Cause | Recovery |
|------------|-------|----------|
| Memory error | Dataset too large | Use chunking or sampling |
| Type conversion failed | Invalid data format | Apply preprocessing first |
| Encoding issues | Wrong character encoding | Detect and specify encoding |
| Validation failure | Data doesn't meet schema | Review and adjust validation rules |

## Related Skills
- programming (for automation)
- foundations (for data quality concepts)
- databases-sql (for SQL-based cleaning)

Overview

This skill teaches practical data cleaning, preprocessing, and quality assurance techniques for analytics workflows. It focuses on preparing reliable datasets by addressing missing values, outliers, type issues, duplicates, and string normalization. The goal is to make data analysis and modeling safer and faster by ensuring data quality at the source.

How this skill works

The skill inspects datasets to detect common problems: missing or malformed values, inconsistent types, duplicate records, outliers, and encoding or string issues. It provides methods for recovery such as imputation, deletion, chunked processing for large files, type conversion strategies, and validation against expected schemas. It also suggests error handling steps for memory limits and encoding mismatches.

When to use it

  • Before exploratory data analysis or modeling to ensure input quality
  • When you receive messy or unknown-format datasets from external sources
  • During ETL pipelines to enforce consistent downstream data
  • When encountering frequent type conversion, encoding, or memory errors
  • Prior to reporting or BI dashboards to avoid misleading summaries

Best practices

  • Profile data first to quantify missingness, distributions, and duplicates
  • Prefer reproducible, programmatic cleaning steps over manual edits
  • Use sampling or chunking for very large datasets to avoid memory errors
  • Validate cleaned data against a schema and log assumptions and imputation
  • Treat string normalization and encoding early to prevent downstream bugs

Example use cases

  • Imputing missing demographic fields before building a customer churn model
  • Removing duplicate rows and normalizing categorical labels for reporting
  • Detecting and capping outliers in transaction amounts to stabilize modeling
  • Converting and validating date and numeric formats from CSV imports
  • Chunked processing of multi-gigabyte logs to handle memory constraints

FAQ

How do I choose between imputation and deletion for missing values?

Decide based on missingness mechanism and impact: delete when missingness is small and random; impute when preserving sample size matters or missingness is systematic. Document the choice and test downstream effects.

What if type conversion fails at scale?

First profile offending rows to identify patterns, then apply targeted preprocessing (e.g., strip non-numeric characters, fix encodings). For large files, run conversions on sampled chunks before full-scale processing.