home / skills / openclaw / skills / crm-data-cleaner

crm-data-cleaner skill

/skills/luigi08001/crm-data-cleaner

This skill cleans CRM data by deduplicating, normalizing, and enriching contacts and companies across major CRMs.

npx playbooks add skill openclaw/skills --skill crm-data-cleaner

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
28.6 KB
---
name: crm-data-cleaner
description: "Deduplicate, normalize, and enrich CRM contacts and companies. Use when a user needs to clean CRM data, find duplicate contacts, standardize phone numbers or emails, merge duplicate records, audit data quality, or enrich contacts with external sources like Clearbit or Apollo. Works with HubSpot, Salesforce, Pipedrive, or any CRM with CSV export. Instruction-only skill — no scripts or code execution. All operations are performed via CRM platform APIs or CSV export/import workflows."
metadata:
  openclaw:
    requires:
      env:
        - HUBSPOT_ACCESS_TOKEN
    primaryCredential: HUBSPOT_ACCESS_TOKEN
    credentialNotes: "Required for HubSpot API access. For Clearbit/Apollo enrichment, set CLEARBIT_API_KEY or APOLLO_API_KEY as needed. For Salesforce, set SALESFORCE_ACCESS_TOKEN. Only the credentials for your specific CRM and enrichment provider are needed."
---

# CRM Data Cleaner — Dedup, Normalize & Enrich Contacts

## Overview

Clean, accurate CRM data is the foundation of effective sales and marketing operations. Poor data quality costs businesses an average of $3.1 million annually through wasted time, missed opportunities, and ineffective campaigns. This skill provides comprehensive frameworks, tools, and automation strategies to maintain pristine contact and company data across all major CRM platforms.

This guide covers the three pillars of CRM data hygiene: **Deduplication** (removing duplicate records), **Normalization** (standardizing data formats), and **Enrichment** (filling missing information with reliable external sources).

## Table of Contents

1. [Understanding Data Quality Issues](#understanding-data-quality-issues)
2. [Deduplication Strategy](#deduplication-strategy)
3. [Data Normalization](#data-normalization)
4. [Data Enrichment](#data-enrichment)
5. [Platform-Specific Implementation](#platform-specific-implementation)
6. [Automation and Monitoring](#automation-and-monitoring)
7. [Maintenance and Governance](#maintenance-and-governance)
8. [Advanced Techniques](#advanced-techniques)

## Understanding Data Quality Issues

### Common CRM Data Problems

**Duplicate Records (30-40% of databases)**
- Multiple entries for same person/company
- Slight variations in name spelling
- Different email addresses for same contact
- Incomplete vs. complete records

**Inconsistent Formatting**
- Phone numbers: (555) 123-4567 vs 555-123-4567 vs +1.555.123.4567
- Company names: "IBM Corp" vs "International Business Machines Corporation"
- Addresses: "St" vs "Street", "CA" vs "California"
- Job titles: "VP Marketing" vs "Vice President of Marketing"

**Missing Information**
- 60% of B2B contacts missing phone numbers
- 40% missing company information
- 35% missing job titles
- 25% missing complete addresses

**Outdated Information**
- Contact changes jobs (20% annually)
- Email addresses become invalid (25% every 2 years)
- Phone numbers change
- Companies reorganize, merge, or rebrand

### Impact on Business Operations

**Sales Productivity Loss**
- 27% of sales time spent on data entry and management
- Missed follow-ups due to duplicate records
- Confusion over primary contact information
- Difficulty identifying decision makers

**Marketing Campaign Inefficiency**
- Increased email bounce rates
- Multiple messages to same recipient
- Poor segmentation due to incomplete data
- Inaccurate reporting and attribution

**Customer Experience Issues**
- Multiple sales reps contacting same prospect
- Conflicting information across touchpoints
- Poor personalization due to incomplete profiles
- Frustration from repeated information requests

### Data Quality Assessment Framework

**Completeness Score**
- Required fields populated: Target 95%
- Optional fields populated: Target 70%
- Critical fields (email, company): Target 98%

**Accuracy Score**
- Valid email format: Target 99%
- Valid phone format: Target 95%
- Verified addresses: Target 90%

**Consistency Score**
- Standardized formatting: Target 95%
- Consistent naming conventions: Target 90%
- Aligned data across systems: Target 95%

**Uniqueness Score**
- Duplicate contact rate: Target <2%
- Duplicate company rate: Target <1%
- Clean merge history: Target 100%

## Deduplication Strategy

### Types of Duplicates

**Exact Duplicates**
- Identical records with same key fields
- Usually caused by import errors
- Easy to identify and merge automatically

**Near Duplicates**
- Similar but not identical records
- Name variations: "Bob Smith" vs "Robert Smith"
- Email variations: personal vs business emails
- Require fuzzy matching algorithms

**Company Duplicates**
- Same company with different names
- "Apple Inc" vs "Apple Computer" vs "Apple"
- Subsidiary vs parent company confusion
- Domain-based matching challenges

**Household/Account Duplicates**
- Multiple contacts at same company
- Family members at same address
- Different roles but same organization

### Duplicate Detection Methods

#### Primary Key Matching
**Email-Based Matching** (Most Reliable)
```
Match Criteria:
- Exact email match = 100% duplicate probability
- Domain + similar names = 85% probability
- Multiple emails for same person = merge candidates
```

**Phone-Based Matching** (Secondary)
```
Match Criteria:
- Exact phone match + similar name = 90% probability
- Same phone, different names = investigate
- Multiple formats of same number = normalize first
```

**Name + Company Matching** (Fuzzy)
```
Match Criteria:
- Exact name + exact company = 95% probability
- Similar name + exact company = 80% probability
- Exact name + similar company = 70% probability
```

#### Advanced Matching Algorithms

**Levenshtein Distance**
- Measures character differences between strings
- Useful for typos and variations
- Example: "Smith" vs "Smyth" = distance of 1

**Soundex Matching**
- Phonetic matching algorithm
- Groups similar-sounding names
- Example: "Smith", "Smyth", "Smithe" = same soundex

**Token Matching**
- Breaks names into components
- Matches individual parts
- Example: "John Michael Smith" matches "J.M. Smith"

### Deduplication Workflow

#### Phase 1: Automated Detection

**High-Confidence Matches (90%+ probability)**
- Exact email matches
- Identical phone + similar names
- Same LinkedIn profile URLs
- Automatic flagging for review

**Medium-Confidence Matches (60-89% probability)**
- Similar names + same company
- Name variations + same domain
- Fuzzy phone number matches
- Queue for manual review

**Low-Confidence Matches (40-59% probability)**
- Loose name similarities
- Possible company matches
- Require detailed investigation

#### Phase 2: Manual Review Process

**Review Queue Prioritization**
1. High-value accounts (enterprise clients)
2. Active opportunities
3. Recent activity (last 30 days)
4. Marketing qualified leads
5. Bulk import suspects

**Review Criteria Checklist**
- [ ] Same person confirmation
- [ ] Most complete record identification
- [ ] Activity history preservation
- [ ] Integration considerations
- [ ] Sales team notifications needed

#### Phase 3: Merge Execution

**Pre-Merge Validation**
- Backup critical data
- Identify master record
- Map fields to preserve
- Note dependencies (campaigns, workflows)

**Field Merge Rules**
- Primary email: Business > Personal > Most recent
- Phone: Mobile > Direct line > Main number
- Address: Most complete > Most recent
- Job title: Most senior > Most recent
- Company: Most complete > Most recent

**Post-Merge Cleanup**
- Update related records
- Refresh reports and lists
- Notify affected team members
- Document merge decisions

### Platform-Specific Deduplication

#### HubSpot Deduplication

**Native Duplicate Management**
- Automatic duplicate detection
- Merge suggestions in contacts view
- Bulk merge capabilities
- Activity history preservation

**Custom Duplicate Rules**
```
Email + Company Domain matching
Name similarity + Phone matching
LinkedIn URL exact matching
Custom property combinations
```

**API-Based Deduplication**
```python
# Example HubSpot duplicate detection
import requests

def find_hubspot_duplicates(api_key, batch_size=100):
    url = f"https://api.hubapi.com/contacts/v1/lists/all/contacts/all"
    params = {
        'hapikey': api_key,
        'count': batch_size,
        'property': ['email', 'firstname', 'lastname', 'company']
    }
    # Implementation details in scripts/
```

#### Salesforce Deduplication

**Duplicate Rules Setup**
- Standard duplicate rules (Lead/Contact)
- Custom matching rules
- Automatic alerts vs blocking
- Duplicate job monitoring

**Third-Party Tools**
- Duplicate Check by CRM Science
- Cloudingo duplicate management
- DemandTools by Validity
- RingLead data management

#### Pipedrive Deduplication

**Manual Duplicate Detection**
- Smart Contact Data feature
- Bulk operations for merging
- Organization-level deduplication
- Custom field mapping

## Data Normalization

### Phone Number Standardization

#### Global Phone Format Standards

**North American Numbers**
```
Input Variations:
- (555) 123-4567
- 555-123-4567
- 555.123.4567
- +1 555 123 4567
- 5551234567

Standardized Output:
- Display: +1 (555) 123-4567
- Storage: +15551234567
- Search: 15551234567
```

**International Numbers**
```
Input Variations:
- +44 20 7946 0958 (UK)
- 020 7946 0958 (UK local)
- +49 30 12345678 (Germany)
- 030-12345678 (Germany local)

Standardized Output:
- Display: +44 20 7946 0958
- Storage: +442079460958
```

#### Phone Validation Rules

**Format Validation**
- Length checks by country
- Area code validation
- Mobile vs landline identification
- Do Not Call registry checking

**Quality Indicators**
- Valid: Properly formatted, verified number
- Invalid: Wrong format, disconnected
- Mobile: Cell phone identified
- International: Non-domestic number
- Suspicious: Pattern matching fake numbers

### Email Address Normalization

#### Email Format Standardization

**Case Normalization**
```
Input: [email protected]
Output: [email protected]
```

**Domain Standardization**
```
Common Variations:
- gmail.com vs googlemail.com → gmail.com
- hotmail.com vs live.com vs outlook.com → outlook.com
- yahoo.com vs ymail.com → yahoo.com
```

**Plus Addressing Removal**
```
Input: [email protected]
Output: [email protected]
```

**Dot Normalization (Gmail)**
```
Input: [email protected]
Output: [email protected]
```

#### Email Validation Levels

**Syntax Validation** (Level 1)
- RFC 5322 compliance
- Valid character checking
- Proper format structure

**Domain Validation** (Level 2)
- MX record verification
- Domain existence checking
- Subdomain validation

**Mailbox Validation** (Level 3)
- SMTP connection testing
- Mailbox existence verification
- Deliverability scoring

### Name Standardization

#### Personal Name Formatting

**Name Case Normalization**
```
Input Variations:
- JOHN SMITH
- john smith
- John SMITH
- jOHN sMITH

Standardized Output:
- John Smith
```

**Name Component Parsing**
```
Input: "Dr. John Michael Smith Jr."
Parsed Components:
- Title: Dr.
- First Name: John
- Middle Name: Michael
- Last Name: Smith
- Suffix: Jr.
```

**Cultural Name Considerations**
- Eastern vs Western name orders
- Hyphenated names handling
- Multiple surname traditions
- Title and honorific preservation

#### Company Name Standardization

**Legal Entity Normalization**
```
Input Variations:
- Apple Inc.
- Apple Incorporated
- Apple, Inc
- Apple Computer Inc.

Standardized Output:
- Apple Inc.
```

**Common Abbreviations**
```
Standard Mappings:
- Corp → Corporation
- Co → Company
- Ltd → Limited
- LLC → Limited Liability Company
- LP → Limited Partnership
```

**DBA (Doing Business As) Handling**
```
Primary: Microsoft Corporation
DBA: Microsoft, MSFT
Subsidiaries: GitHub, LinkedIn
```

### Address Normalization

#### Address Component Standardization

**Street Address Formatting**
```
Input Variations:
- 123 Main St.
- 123 Main Street
- 123 MAIN ST
- 123 main st

Standardized Output:
- 123 Main Street
```

**State/Province Normalization**
```
US States:
- California → CA
- New York → NY
- Texas → TX

Canadian Provinces:
- Ontario → ON
- British Columbia → BC
- Quebec → QC
```

**Postal Code Formatting**
```
US ZIP Codes:
- 12345 → 12345
- 12345-6789 → 12345-6789
- 123456789 → 12345-6789

Canadian Postal Codes:
- k1a0a6 → K1A 0A6
- K1A0A6 → K1A 0A6
```

#### International Address Standards

**United Kingdom Addresses**
```
Standard Format:
[Building Number] [Street Name]
[District/Area]
[Town/City]
[County] [Postcode]
[Country]
```

**European Address Formats**
- German addresses: Street first, house number after
- French addresses: Special character handling
- Nordic countries: Unique postal systems

### Job Title Normalization

#### Title Standardization Rules

**Seniority Level Mapping**
```
C-Level Titles:
- CEO, Chief Executive Officer
- CTO, Chief Technology Officer
- CMO, Chief Marketing Officer
- CFO, Chief Financial Officer

VP Level Titles:
- VP, Vice President
- SVP, Senior Vice President
- EVP, Executive Vice President

Director Level Titles:
- Director, Dir
- Senior Director, Sr. Director
- Executive Director, Exec Director
```

**Functional Area Mapping**
```
Marketing Titles:
- Marketing Manager → Marketing
- Brand Manager → Marketing
- Content Manager → Marketing
- Digital Marketing Specialist → Marketing

Sales Titles:
- Sales Representative → Sales
- Account Manager → Sales
- Business Development → Sales
- Sales Engineer → Sales
```

**Industry-Specific Normalization**
- Healthcare: MD, RN, PharmD standardization
- Legal: JD, Esq., Partner titles
- Academia: PhD, Professor, Dean titles
- Government: GS levels, military ranks

## Data Enrichment

### Enrichment Data Sources

#### Free Data Sources

**Social Media Platforms**
- LinkedIn: Job titles, company info, connections
- Twitter: Engagement data, interests
- Facebook: Personal interests (B2C)
- GitHub: Developer profiles, technologies

**Public Databases**
- Government business registrations
- SEC filings for public companies
- Patent databases
- Professional licensing boards

**Web Scraping Sources**
- Company websites: Team pages, about sections
- Industry directories
- Conference speaker lists
- Press release databases

#### Paid Enrichment Services

**Comprehensive B2B Platforms**

**ZoomInfo** (Premium)
- Contact: $14,995/year for 10,000 credits
- Coverage: 100M+ contacts, 14M+ companies
- Data Types: Direct phone, email, technographics
- Accuracy: 95% for contact info
- API: RESTful with real-time lookups

**Apollo** (Mid-Range)
- Contact: $49-149/month per user
- Coverage: 275M+ contacts, 73M+ companies
- Data Types: Email, phone, intent signals
- Accuracy: 85-90% email accuracy
- API: Generous rate limits, bulk operations

**Clearbit** (Developer-Focused)
- Contact: $99-999/month based on volume
- Coverage: 85M+ contacts, 12M+ companies
- Data Types: Firmographics, technographics
- Accuracy: 85% contact accuracy
- API: Real-time enrichment, webhooks

**Hunter** (Email-Focused)
- Contact: $49-399/month
- Coverage: Email finder and verification
- Data Types: Email addresses, domain search
- Accuracy: 95% email verification
- API: Bulk processing, domain search

#### Specialized Data Providers

**Technographic Data**
- BuiltWith: Website technology stacks
- Datanyze: Technology adoption data
- 6sense: Intent and technology data
- Bombora: Intent signal data

**Financial Data**
- Dun & Bradstreet: Credit and financial data
- Crunchbase: Funding and investor data
- PitchBook: Private market data
- FactSet: Public company financials

**Industry-Specific Data**
- Healthcare: NPI database, medical licenses
- Legal: Bar association directories
- Real Estate: MLS data, property records
- Education: Institution directories

### Enrichment Workflow

#### Data Assessment Phase

**Missing Data Analysis**
```sql
-- Example missing data analysis
SELECT 
    COUNT(*) as total_contacts,
    COUNT(phone) as has_phone,
    COUNT(company) as has_company,
    COUNT(job_title) as has_title,
    (COUNT(*) - COUNT(phone)) as missing_phone,
    (COUNT(*) - COUNT(company)) as missing_company
FROM contacts;
```

**Enrichment Priority Matrix**
- High Value + High Confidence = Immediate enrichment
- High Value + Low Confidence = Manual review
- Low Value + High Confidence = Batch processing
- Low Value + Low Confidence = Skip

#### Batch Enrichment Process

**Data Preparation**
1. Export contact list with unique identifiers
2. Identify enrichment keys (email, domain, name+company)
3. Remove duplicates to avoid duplicate charges
4. Validate existing data quality

**Enrichment Execution**
1. Email-based enrichment (highest accuracy)
2. Domain-based company enrichment
3. Name + Company fuzzy matching
4. Social profile matching
5. Phone number verification

**Data Validation**
1. Cross-reference multiple sources
2. Confidence scoring per data point
3. Flag conflicting information
4. Preserve data provenance

**Integration Back to CRM**
1. Map enriched fields to CRM properties
2. Update existing records without overwriting good data
3. Track enrichment timestamps
4. Log enrichment sources

### Real-Time Enrichment

#### Form Submission Enrichment
```javascript
// Example real-time enrichment on form submit
document.getElementById('leadForm').addEventListener('submit', async function(e) {
    const email = document.getElementById('email').value;
    const company = document.getElementById('company').value;
    
    // Enrich contact data
    const enrichedData = await enrichContact(email, company);
    
    // Update hidden form fields
    updateFormFields(enrichedData);
});
```

#### CRM Integration Triggers
- New contact creation
- Email address updates
- Company field changes
- Lead scoring threshold crossing

#### Progressive Profiling
- Gradual data collection over multiple interactions
- Smart form field suggestions
- Prefilling forms with known data
- A/B testing optimal field combinations

### Data Quality Monitoring

#### Enrichment Accuracy Tracking

**Verification Metrics**
- Email deliverability rates
- Phone connection success rates
- LinkedIn profile match accuracy
- Company information consistency

**Data Decay Monitoring**
- Email bounce rates over time
- Phone number disconnect rates
- Job title change frequency
- Company merger/acquisition impact

**Source Performance Comparison**
- Accuracy by data provider
- Cost per successful enrichment
- Update frequency
- Coverage by industry/region

## Platform-Specific Implementation

### HubSpot Data Cleaning

#### Native HubSpot Tools

**Data Quality Command Center**
- Duplicate detection and management
- Property formatting rules
- Workflow-based data validation
- Automated data hygiene tasks

**Property Settings for Data Quality**
- Field validation rules
- Required field enforcement
- Format standardization
- Default value management

**Workflow Automation**
```
Trigger: Contact is created or updated
Condition: Email domain contains common typos
Action: Flag for manual review + normalize email
```

#### HubSpot Integrations

**Third-Party Apps**
- Insycle: Advanced deduplication and data management
- PieSync: Data synchronization across platforms
- Zapier: Custom data cleaning automations

**Custom Development**
```javascript
// HubSpot API example for bulk data cleaning
const hubspot = require('@hubspot/api-client');

async function cleanContactData(contacts) {
    const hubspotClient = new hubspot.Client({ apiKey: API_KEY });
    
    const cleanedContacts = contacts.map(contact => ({
        id: contact.id,
        properties: {
            phone: normalizePhone(contact.properties.phone),
            email: normalizeEmail(contact.properties.email),
            company: normalizeCompanyName(contact.properties.company)
        }
    }));
    
    return await hubspotClient.crm.contacts.batchApi.update({
        inputs: cleanedContacts
    });
}
```

### Salesforce Data Cleaning

#### Native Salesforce Features

**Duplicate Management**
- Standard duplicate rules
- Custom matching rules
- Duplicate alerts and blocking
- Merge wizard functionality

**Data Validation Rules**
```apex
// Example validation rule for phone format
REGEX(Phone, "^\\+?1?[2-9]\\d{2}[2-9]\\d{2}\\d{4}$")
```

**Flow-Based Automation**
- Screen flows for data entry validation
- Record-triggered flows for cleaning
- Scheduled flows for batch processing

#### Salesforce Apps and Tools

**Paid Solutions**
- Cloudingo: Comprehensive data management
- DemandTools: Advanced deduplication
- RingLead: Data cleaning and enrichment

**Custom Apex Solutions**
```apex
// Custom Apex for email normalization
public class EmailNormalizer {
    public static String normalizeEmail(String email) {
        if (String.isBlank(email)) return email;
        return email.toLowerCase().trim();
    }
}
```

### Pipedrive Data Cleaning

#### Native Pipedrive Features

**Smart Contact Data**
- Automatic duplicate detection
- Merge suggestions
- Data enrichment from public sources

**Custom Fields and Validation**
- Required field settings
- Field type restrictions
- Custom property management

**Automation Features**
- Workflow automation for data tasks
- Email sync and normalization
- Activity-based data updates

## Automation and Monitoring

### Automated Data Quality Workflows

#### Continuous Data Validation

**Real-Time Validation**
- Form submission validation
- Email syntax checking
- Phone format verification
- Required field enforcement

**Scheduled Batch Processing**
- Daily duplicate detection runs
- Weekly enrichment batches
- Monthly data quality reports
- Quarterly complete audits

**Event-Triggered Cleaning**
- New record creation
- Data import completion
- Email bounce notifications
- Contact inactivity alerts

#### Quality Score Automation

**Contact Quality Scoring**
```python
def calculate_contact_quality_score(contact):
    score = 0
    
    # Completeness (40 points)
    if contact.email: score += 15
    if contact.phone: score += 10
    if contact.company: score += 10
    if contact.job_title: score += 5
    
    # Accuracy (40 points)
    if is_valid_email(contact.email): score += 20
    if is_valid_phone(contact.phone): score += 20
    
    # Freshness (20 points)
    days_since_update = (datetime.now() - contact.last_modified).days
    if days_since_update < 30: score += 20
    elif days_since_update < 90: score += 10
    
    return min(score, 100)
```

**Company Quality Scoring**
- Industry classification accuracy
- Company size verification
- Website and domain validation
- Social media presence verification

### Monitoring and Alerting

#### Key Performance Indicators

**Data Quality Metrics**
- Overall completeness percentage
- Duplicate contact percentage
- Email deliverability rate
- Phone number accuracy rate

**Trend Analysis**
- Data quality improvement over time
- Source performance comparison
- Seasonal data decay patterns
- Team adoption metrics

**Alert Thresholds**
- Duplicate detection: >5% increase
- Email bounces: >10% for campaign
- Missing data: >20% for key fields
- Enrichment failures: >30% error rate

#### Reporting Dashboard

**Executive Summary Dashboard**
- Total records and quality score
- Data completeness by key fields
- Enrichment ROI analysis
- Team productivity impact

**Operational Dashboard**
- Daily processing statistics
- Error logs and resolution status
- Data source performance metrics
- Automation workflow status

**Detailed Analysis Reports**
- Field-by-field completion rates
- Source-by-source quality analysis
- Historical trend analysis
- Predictive quality forecasting

### Integration Architecture

#### API-Based Data Flows

**Inbound Data Processing**
```
External Source → Validation → Normalization → Deduplication → Enrichment → CRM
```

**Outbound Data Synchronization**
```
CRM → Clean Data → External Systems (Email, Analytics, etc.)
```

**Real-Time vs Batch Processing**
- Real-time: Form submissions, high-value contacts
- Batch: Bulk imports, scheduled maintenance
- Hybrid: Priority-based processing queues

## Maintenance and Governance

### Data Governance Framework

#### Data Stewardship Roles

**Data Owner (Executive Level)**
- Define data quality standards
- Approve data policies
- Budget allocation for tools
- Strategic oversight

**Data Steward (Operational Level)**
- Daily quality monitoring
- Process execution
- Issue escalation
- Training coordination

**Data Users (Sales/Marketing Teams)**
- Data entry compliance
- Quality feedback
- Process adherence
- Issue reporting

#### Data Quality Policies

**Data Entry Standards**
```
Contact Creation Requirements:
- Email address (validated)
- Company name (standardized)
- Job title (normalized)
- Phone number (formatted)
- Source attribution
```

**Update Procedures**
- Regular data refresh cycles
- Change approval workflows
- Bulk update protocols
- Emergency correction procedures

**Retention and Archival**
- Active record criteria
- Archival trigger conditions
- Data deletion policies
- Compliance requirements

### Change Management

#### Team Training Programs

**Basic Data Hygiene Training**
- Importance of data quality
- Common data entry mistakes
- Platform-specific best practices
- Quality monitoring tools

**Advanced Training Topics**
- Duplicate detection techniques
- Enrichment strategy optimization
- Automation workflow design
- Reporting and analysis

**Ongoing Education**
- Monthly quality scorecards
- Best practice sharing sessions
- Platform update training
- Industry trend analysis

#### Process Documentation

**Standard Operating Procedures**
- Daily maintenance tasks
- Weekly quality reviews
- Monthly deep cleaning
- Quarterly audits

**Troubleshooting Guides**
- Common error resolution
- Escalation procedures
- Recovery protocols
- Emergency contacts

### Compliance and Security

#### Data Privacy Compliance

**GDPR Considerations**
- Consent management
- Data processing justification
- Right to be forgotten
- Data portability requirements

**CCPA Requirements**
- Consumer rights notifications
- Opt-out mechanisms
- Data sale disclosures
- Processing transparency

#### Data Security

**Access Controls**
- Role-based permissions
- Audit logging
- Change tracking
- Approval workflows

**Data Protection**
- Encryption standards
- Backup procedures
- Recovery protocols
- Breach notification

## Advanced Techniques

### Machine Learning Applications

#### Predictive Data Quality

**Quality Score Prediction**
- Predict record quality degradation
- Identify enrichment opportunities
- Forecast maintenance needs
- Optimize cleaning schedules

**Duplicate Detection ML**
- Neural network matching
- Similarity scoring algorithms
- Clustering for bulk identification
- Continuous learning from feedback

#### Natural Language Processing

**Company Name Matching**
- Fuzzy string matching
- Alias recognition
- Subsidiary relationship mapping
- M&A event detection

**Job Title Standardization**
- Role classification
- Seniority level prediction
- Function area mapping
- Industry-specific normalization

### Advanced Automation

#### Intelligent Data Routing

**Smart Assignment Rules**
```python
def assign_data_cleaning_task(record, quality_issues):
    if record.value_tier == 'enterprise':
        return 'manual_review_queue'
    elif len(quality_issues) > 3:
        return 'bulk_processing_queue'
    elif 'duplicate' in quality_issues:
        return 'dedup_automation_queue'
    else:
        return 'standard_cleaning_queue'
```

**Priority-Based Processing**
- Value-based prioritization
- Urgency classification
- Resource allocation optimization
- SLA management

#### Custom Data Pipelines

**Real-Time Processing**
- Stream processing for immediate validation
- Event-driven cleaning triggers
- Microservice architecture
- API-first design

**Batch Processing**
- Distributed processing systems
- Scheduled job management
- Error handling and retry logic
- Progress monitoring

### Integration Ecosystem

#### Multi-Platform Synchronization

**Bidirectional Sync**
- CRM ↔ Marketing Automation
- CRM ↔ Sales Engagement
- CRM ↔ Customer Support
- CRM ↔ Analytics Platforms

**Conflict Resolution**
- Master data management
- Field-level precedence rules
- Timestamp-based updates
- Manual override capabilities

#### API-First Architecture

**RESTful API Design**
```python
# Example API endpoint for data cleaning
@app.route('/api/v1/contacts/clean', methods=['POST'])
def clean_contact_data():
    data = request.get_json()
    
    # Validate input
    if not validate_input(data):
        return {'error': 'Invalid input'}, 400
    
    # Process cleaning
    cleaned_data = {
        'email': normalize_email(data.get('email')),
        'phone': normalize_phone(data.get('phone')),
        'company': normalize_company(data.get('company'))
    }
    
    return {'cleaned_data': cleaned_data}, 200
```

This comprehensive CRM data cleaning skill provides the foundation for maintaining high-quality customer and prospect data across all major platforms. Implementation of these strategies will dramatically improve sales productivity, marketing effectiveness, and overall customer experience while reducing operational overhead and compliance risk.

Overview

This skill cleans, deduplicates, normalizes, and enriches CRM contacts and companies to restore reliable, actionable data. It supports CSV exports or direct APIs for HubSpot, Salesforce, Pipedrive and similar systems, and integrates enrichment sources like Clearbit and Apollo. Use it to reduce duplicate records, standardize formats, and add missing firmographic or contact details for better sales and marketing outcomes.

How this skill works

The skill inspects contact and company records for exact and fuzzy duplicates using email, phone, name, and domain heuristics plus advanced algorithms like Levenshtein and Soundex. It normalizes phone numbers, emails, names, addresses, and job titles to consistent storage and display formats. When enabled, it calls enrichment providers to fill missing fields and scores data quality, then produces merge recommendations, review queues, and safe merge/backup procedures.

When to use it

  • Before importing large contact lists into a CRM
  • To prepare outreach lists and reduce bounce rates
  • When migrating between CRMs or restoring archived data
  • During regular data governance and quarterly audits
  • To enrich incompletely profiled leads and accounts

Best practices

  • Always export and back up raw CSV or CRM snapshots before merges
  • Run automated high-confidence deduplication first, then queue medium-confidence items for manual review
  • Normalize phone and email fields before duplicate detection to improve match accuracy
  • Prioritize merges for active opportunities and high-value accounts
  • Document merge rules and preserve activity history and integrations

Example use cases

  • Deduplicate a 100k-contact HubSpot export and produce a prioritized merge queue
  • Normalize international phone numbers and store in E.164 for consistent dialing
  • Standardize company names and legal entity suffixes across Salesforce for accurate reporting
  • Enrich a demand generation list with Clearbit or Apollo to add job titles and firmographics
  • Audit CRM data quality and generate completeness/accuracy/uniqueness scores for governance

FAQ

Which CRMs are supported?

Works with any CRM that can export CSV files; direct integrations are available for HubSpot, Salesforce, and Pipedrive via their APIs.

How are duplicates detected safely?

Duplicates are classified by confidence: exact-match rules auto-flag high-confidence duplicates, fuzzy algorithms produce medium/low-confidence candidates which can be reviewed before merge.

Can I preserve activity history and integrations when merging?

Yes. The workflow includes pre-merge validation, master-record selection, field mapping rules, and steps to preserve activity history and downstream integrations.