home / skills / pluginagentmarketplace / custom-plugin-data-analyst / data-cleaning
This skill helps you clean, validate, and preprocess datasets to improve analytics quality through missing value handling, outlier treatment, and
npx playbooks add skill pluginagentmarketplace/custom-plugin-data-analyst --skill data-cleaningReview the files below or copy the command above to add this skill to your agents.
---
name: data-cleaning
description: Data cleaning, preprocessing, and quality assurance techniques
version: "2.0.0"
sasmp_version: "2.0.0"
bonded_agent: 05-programming-expert
bond_type: SECONDARY_BOND
# Skill Configuration
config:
atomic: true
retry_enabled: true
max_retries: 3
backoff_strategy: exponential
# Parameter Validation
parameters:
tool_preference:
type: string
required: true
enum: [python, r, excel, sql]
default: python
data_size:
type: string
required: false
enum: [small, medium, large]
default: medium
# Observability
observability:
logging_level: info
metrics: [rows_cleaned, missing_handled, duplicates_removed]
---
# Data Cleaning Skill
## Overview
Master data cleaning and preprocessing techniques essential for reliable analytics.
## Topics Covered
- Missing value handling (imputation, deletion)
- Outlier detection and treatment
- Data type conversion and validation
- Duplicate identification and removal
- String cleaning and normalization
## Learning Outcomes
- Clean messy datasets
- Handle missing data appropriately
- Detect and treat outliers
- Ensure data quality
## Error Handling
| Error Type | Cause | Recovery |
|------------|-------|----------|
| Memory error | Dataset too large | Use chunking or sampling |
| Type conversion failed | Invalid data format | Apply preprocessing first |
| Encoding issues | Wrong character encoding | Detect and specify encoding |
| Validation failure | Data doesn't meet schema | Review and adjust validation rules |
## Related Skills
- programming (for automation)
- foundations (for data quality concepts)
- databases-sql (for SQL-based cleaning)
This skill teaches practical data cleaning, preprocessing, and quality assurance techniques for analytics workflows. It focuses on preparing reliable datasets by addressing missing values, outliers, type issues, duplicates, and string normalization. The goal is to make data analysis and modeling safer and faster by ensuring data quality at the source.
The skill inspects datasets to detect common problems: missing or malformed values, inconsistent types, duplicate records, outliers, and encoding or string issues. It provides methods for recovery such as imputation, deletion, chunked processing for large files, type conversion strategies, and validation against expected schemas. It also suggests error handling steps for memory limits and encoding mismatches.
How do I choose between imputation and deletion for missing values?
Decide based on missingness mechanism and impact: delete when missingness is small and random; impute when preserving sample size matters or missingness is systematic. Document the choice and test downstream effects.
What if type conversion fails at scale?
First profile offending rows to identify patterns, then apply targeted preprocessing (e.g., strip non-numeric characters, fix encodings). For large files, run conversions on sampled chunks before full-scale processing.