home / skills / willsigmon / sigstack / knack-data-cleaner

knack-data-cleaner skill

/archive/knack-data-cleaner

This skill ensures HTI compliance and dashboard accuracy by validating, deduplicating, normalizing, and auditing data across records.

npx playbooks add skill willsigmon/sigstack --skill knack-data-cleaner

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
11.9 KB
---
name: Knack Data Cleaner
description: Ensures accuracy for HTI compliance and performance dashboards through data validation, deduplication, normalization, and integrity checks. Critica...
allowed-tools: Read, Edit
---

# Knack Data Cleaner

## Purpose
Ensures accuracy for HTI compliance and performance dashboards through data validation, deduplication, normalization, and integrity checks. Critical for NCDIT reporting accuracy.

## Core Functions

### deduplicate_records
**Purpose**: Identify and remove/merge duplicate device or donor records

**Detection Strategies**:
- Exact match: Same serial number, donor email, etc.
- Fuzzy match: Similar names, close dates
- Bulk import duplicates: Same acquisition batch

**Example**:
```javascript
const cleaned = await deduplicate_records({
  object_key: "object_1",
  match_fields: ["serial_number", "acquisition_date"],
  strategy: "keep_latest" // or "merge", "keep_first"
});

// Output:
// {
//   original_count: 3520,
//   duplicates_found: 37,
//   final_count: 3483,
//   removed: [
//     { id: "rec_123", reason: "Duplicate serial: ABC123" },
//     // ...
//   ]
// }
```

**Deduplication Strategies**:
```javascript
const strategies = {
  keep_latest: "Retain most recently modified record",
  keep_first: "Keep oldest record (by created_date)",
  merge: "Combine data from all duplicates",
  manual_review: "Flag for human verification"
};
```

### normalize_fields
**Purpose**: Standardize data formats for consistency

**Normalization Rules**:

#### County Names
```javascript
const county_normalizer = {
  "wake": "Wake",
  "WAKE": "Wake",
  "Wake County": "Wake",
  "wake co": "Wake",
  "wake co.": "Wake"
};
```

#### Phone Numbers
```javascript
// (919) 555-1234 → 9195551234
// 919-555-1234 → 9195551234
// 919.555.1234 → 9195551234
const normalizePhone = (phone) => phone.replace(/\D/g, '');
```

#### Emails
```javascript
// Convert to lowercase, trim whitespace
const normalizeEmail = (email) => email.trim().toLowerCase();
```

#### Dates
```javascript
// Standardize to ISO 8601: YYYY-MM-DD
const normalizeDate = (date) => {
  const d = new Date(date);
  return d.toISOString().split('T')[0];
};
```

#### Device Status
```javascript
const status_normalizer = {
  "acquired": "Acquired",
  "ACQUIRED": "Acquired",
  "received": "Acquired",
  "in progress": "In Progress",
  "ready": "Ready for Donation",
  "delivered": "Presented"
};
```

**Example**:
```javascript
const normalized = await normalize_fields({
  dataset: await fetch_all_pages("object_1"),
  rules: {
    county: county_normalizer,
    phone: normalizePhone,
    email: normalizeEmail,
    date: normalizeDate,
    status: status_normalizer
  }
});
```

### validate_integrity
**Purpose**: Check for logical inconsistencies and data quality issues

**Validation Checks**:

#### Date Logic
```javascript
const date_validations = [
  "acquisition_date < conversion_date",
  "conversion_date < presentation_date",
  "all dates within grant period (2024-2026)",
  "no future dates"
];
```

#### Required Fields
```javascript
const required_by_status = {
  "Acquired": ["donor", "acquisition_date", "quantity"],
  "Converted": ["converted_date", "os_installed", "tested"],
  "Presented": ["recipient", "county", "presentation_date"],
  "Discarded": ["discard_reason", "discard_date"]
};
```

#### Value Ranges
```javascript
const range_validations = {
  quantity: { min: 1, max: 500 },
  training_duration: { min: 0.5, max: 8 },
  participants: { min: 1, max: 100 },
  donation_value: { min: 0, max: 50000 }
};
```

#### Geographic Validation
```javascript
const valid_counties = [
  "Wake", "Durham", "Vance", "Franklin", "Granville",
  "Halifax", "Wilson", "Edgecombe", "Martin", "Hertford",
  "Greene", "Warren", "Northampton", "Person", "Nash"
];

const county_validation = (county) => {
  return valid_counties.includes(county);
};
```

**Example**:
```javascript
const validation_report = await validate_integrity({
  dataset: await fetch_all_pages("object_1"),
  rules: {
    dates: date_validations,
    required: required_by_status,
    ranges: range_validations,
    counties: valid_counties
  }
});

// Output:
// {
//   total_records: 3483,
//   valid: 3401,
//   errors: [
//     { record_id: "rec_123", issue: "Conversion date before acquisition" },
//     { record_id: "rec_456", issue: "Invalid county: 'Raleigh'" },
//     { record_id: "rec_789", issue: "Missing required field: donor" }
//   ],
//   warnings: [
//     { record_id: "rec_234", issue: "Unusually high quantity: 450" }
//   ]
// }
```

### fix_common_issues
**Purpose**: Auto-correct known data entry mistakes

**Auto-Fix Rules**:

#### Trim Whitespace
```javascript
const fix_whitespace = (record) => {
  Object.keys(record).forEach(key => {
    if (typeof record[key] === 'string') {
      record[key] = record[key].trim();
    }
  });
  return record;
};
```

#### Fix Capitalization
```javascript
const fix_capitalization = (record) => {
  if (record.donor_name) {
    record.donor_name = toTitleCase(record.donor_name);
  }
  if (record.county) {
    record.county = county_normalizer[record.county.toLowerCase()] || record.county;
  }
  return record;
};
```

#### Infer Missing Data
```javascript
const infer_missing = (record) => {
  // If county missing but zip code present, infer county
  if (!record.county && record.zip_code) {
    record.county = inferCountyFromZip(record.zip_code);
  }

  // If discard reason missing but status is Discarded, set default
  if (record.status === "Discarded" && !record.discard_reason) {
    record.discard_reason = "Not specified";
  }

  return record;
};
```

**Example**:
```javascript
const fixed = await fix_common_issues({
  dataset: await fetch_all_pages("object_1"),
  auto_fixes: [
    "trim_whitespace",
    "fix_capitalization",
    "infer_missing_counties",
    "standardize_phone_numbers"
  ]
});

// Output:
// {
//   processed: 3483,
//   fixed: 287,
//   fixes_by_type: {
//     whitespace: 145,
//     capitalization: 67,
//     inferred_county: 42,
//     phone_format: 33
//   }
// }
```

### remove_test_records
**Purpose**: Exclude dummy/test data from production reports

**Detection**:
```javascript
const is_test_record = (record) => {
  const test_indicators = [
    record.donor_name?.toLowerCase().includes("test"),
    record.donor_email?.includes("test@"),
    record.donor_email?.includes("example.com"),
    record.serial_number?.startsWith("TEST"),
    record.notes?.toLowerCase().includes("do not count")
  ];

  return test_indicators.some(indicator => indicator === true);
};
```

**Example**:
```javascript
const production_only = await remove_test_records({
  dataset: await fetch_all_pages("object_1")
});

// Output:
// {
//   original: 3483,
//   test_records_removed: 8,
//   production_records: 3475
// }
```

### enrich_records
**Purpose**: Add calculated or derived fields

**Enrichment Operations**:

#### Add Calculated Fields
```javascript
const enrich = (record) => {
  // Add age of device
  record.days_since_acquisition = daysBetween(record.acquisition_date, new Date());

  // Add conversion time
  if (record.converted_date && record.acquisition_date) {
    record.days_to_convert = daysBetween(record.acquisition_date, record.converted_date);
  }

  // Add delivery time
  if (record.presentation_date && record.converted_date) {
    record.days_to_deliver = daysBetween(record.converted_date, record.presentation_date);
  }

  // Flag slow conversions
  record.conversion_delayed = record.days_to_convert > 30;

  return record;
};
```

#### Add Geographic Context
```javascript
const add_region = (record) => {
  const regions = {
    "Triangle": ["Wake", "Durham", "Franklin"],
    "Northeast": ["Vance", "Granville", "Warren", "Halifax"],
    "Eastern": ["Wilson", "Edgecombe", "Martin", "Greene"]
  };

  record.region = Object.keys(regions).find(region =>
    regions[region].includes(record.county)
  ) || "Other";

  return record;
};
```

#### Add Donor Tier
```javascript
const add_donor_tier = (record) => {
  if (record.donation_value > 1000) {
    record.donor_tier = "Major";
  } else if (record.donation_value > 500) {
    record.donor_tier = "Significant";
  } else {
    record.donor_tier = "Standard";
  }
  return record;
};
```

## HTI-Specific Cleaning Workflows

### Pre-Quarterly Report Cleaning
```javascript
async function clean_for_quarterly_report(quarter, year) {
  // 1. Fetch quarter data
  const raw_data = await fetch_all_pages("object_1", {
    filters: quarterly_filter(quarter, year)
  });

  // 2. Remove test records
  const production = await remove_test_records({ dataset: raw_data });

  // 3. Deduplicate
  const unique = await deduplicate_records({
    dataset: production.production_records,
    match_fields: ["serial_number"]
  });

  // 4. Normalize
  const normalized = await normalize_fields({
    dataset: unique.records,
    rules: all_normalizers
  });

  // 5. Validate
  const validated = await validate_integrity({
    dataset: normalized,
    rules: compliance_rules
  });

  // 6. Fix auto-fixable issues
  const fixed = await fix_common_issues({
    dataset: validated.valid_records
  });

  return {
    clean_data: fixed,
    cleaning_summary: {
      original: raw_data.length,
      test_removed: production.test_records_removed,
      duplicates: unique.duplicates_found,
      validation_errors: validated.errors.length,
      auto_fixed: fixed.fixed
    }
  };
}
```

### Pre-Dashboard Data Cleaning
```javascript
async function clean_for_dashboard() {
  const all_data = await fetch_all_pages("object_1");

  // Quick cleaning for real-time dashboard
  const cleaned = all_data
    .filter(r => !is_test_record(r))
    .map(fix_whitespace)
    .map(fix_capitalization)
    .map(enrich);

  return cleaned;
}
```

## Data Quality Monitoring

### Generate Data Quality Report
```javascript
async function data_quality_report() {
  const all_records = await fetch_all_pages("object_1");

  const report = {
    total_records: all_records.length,
    completeness: {
      with_donor: all_records.filter(r => r.donor).length,
      with_county: all_records.filter(r => r.county).length,
      with_serial: all_records.filter(r => r.serial_number).length
    },
    accuracy: {
      valid_counties: all_records.filter(r => valid_counties.includes(r.county)).length,
      logical_dates: all_records.filter(r => validate_dates(r)).length
    },
    duplicates: await find_duplicates(all_records),
    test_records: all_records.filter(is_test_record).length
  };

  return report;
}
```

### Automated Quality Checks
```javascript
// Run daily at 2 AM
cron.schedule('0 2 * * *', async () => {
  const report = await data_quality_report();

  if (report.duplicates.length > 10) {
    await alert_admin("High duplicate count detected");
  }

  if (report.accuracy.valid_counties / report.total_records < 0.95) {
    await alert_admin("County data quality below threshold");
  }
});
```

## Integration Points

- **knack_reader**: Fetch data for cleaning
- **knack_pagination**: Process large datasets
- **knack_reporting_sync**: Clean before report generation
- **knack_dashboard_ai**: Ensure metrics accuracy
- **knack_goal_tracker**: Validate progress data
- **knack_filter_sort**: Filter out invalid records

## Best Practices

1. **Clean before every report**: Don't trust raw data
2. **Log all cleaning operations**: Audit trail for compliance
3. **Manual review for ambiguous cases**: Don't auto-fix everything
4. **Run quality checks daily**: Catch issues early
5. **Version cleaned datasets**: Track data lineage

## Grant Compliance

### NCDIT Data Standards
- No test/dummy records in official reports
- All dates must be within grant period
- County assignments must be accurate (15-county requirement)
- Device lifecycle dates must be logical
- Training hours must have supporting documentation

### Audit Trail
```javascript
const cleaning_log = {
  timestamp: new Date(),
  operation: "quarterly_report_cleaning",
  quarter: "Q1",
  year: "2025",
  records_processed: 875,
  issues_found: 23,
  auto_fixed: 18,
  manual_review_needed: 5,
  performed_by: "automated_system"
};
```

Overview

This skill enforces HTI compliance and improves dashboard accuracy by validating, deduplicating, normalizing, and enriching Knack datasets. It targets NCDIT reporting requirements and routine performance dashboards to ensure reliable, auditable data for reporting and operations.

How this skill works

The skill inspects datasets fetched from Knack, then runs staged cleaning workflows: test-record removal, deduplication (exact and fuzzy), field normalization (phones, emails, dates, statuses), integrity validation (date logic, required fields, ranges, counties), and auto-fixes for common entry mistakes. It can also enrich records with derived fields and generate ongoing quality reports and alerts.

When to use it

  • Before generating quarterly or grant compliance reports
  • Prior to populating public or internal performance dashboards
  • When importing bulk data or running mass updates
  • As part of nightly automated data quality checks
  • When preparing datasets for audits or versioned releases

Best practices

  • Run the full cleaning workflow before every official report to avoid manual corrections later
  • Log every cleaning operation and preserve a versioned copy of cleaned datasets for audit trails
  • Use manual review for ambiguous deduplication or inferred values rather than auto-merging
  • Tune validation rules (date ranges, required fields, county lists) to match grant period and policy changes
  • Schedule daily automated quality checks and alert on falling thresholds (e.g., county accuracy < 95%)

Example use cases

  • Pre-quarterly-report cleaning: remove tests, deduplicate by serial_number, normalize fields, validate, and auto-fix where safe
  • Real-time dashboard feed: quick removal of test records, whitespace/capitalization fixes, and enrichment for metrics
  • Daily monitoring: run automated quality checks at 2 AM and alert admins on spikes in duplicates or invalid counties
  • Bulk import hygiene: detect and merge duplicate donor/device entries from a recent acquisition batch
  • Audit preparation: produce a cleaning_log and versioned dataset showing issues found, auto-fixed counts, and manual review items

FAQ

Can the skill auto-merge duplicates without human review?

Yes — it supports strategies like keep_latest, keep_first, and merge, but ambiguous matches should be flagged for manual review.

How are counties and dates validated for grant compliance?

County names are normalized against an approved list of 15 counties and dates are checked for logical order and confinement within the grant period (e.g., 2024–2026).