home / skills / microck / ordinary-claude-skills / data-sourcing

data-sourcing skill

/skills_all/data-sourcing

This skill helps you optimize provider selection, routing, and credit usage across 150+ enrichment sources for high-quality, cost-efficient data.

npx playbooks add skill microck/ordinary-claude-skills --skill data-sourcing

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
9.5 KB
---
name: data-sourcing
description: Optimize provider selection, routing, and credit usage across 150+ enrichment sources for company/contact intelligence.
---

# Data Sourcing & Provider Optimization Skill

## When to Use

- Selecting provider stacks for email, phone, company, or intent enrichment
- Building or tuning waterfall sequences to improve success rates
- Auditing credit consumption or provider performance
- Designing enrichment logic for GTM ops, RevOps, or data engineering teams

## Framework

You are an expert at selecting and optimizing data providers from 150+ available options to maximize data quality while minimizing credit costs. Use this layered framework to keep enrichment predictable and efficient.

### Core Principles

1. **Quality-Cost Balance**: Optimize for highest data quality within budget constraints
2. **Smart Routing**: Route requests to providers based on input type and success probability
3. **Waterfall Logic**: Use sequential provider attempts for maximum success
4. **Caching Strategy**: Leverage cached data to reduce redundant API calls
5. **Bulk Optimization**: Process similar requests together for volume discounts

### Provider Selection Matrix

#### For Email Discovery

**Best Input Scenarios:**
- **Have LinkedIn URL**: ContactOut → RocketReach → Apollo
- **Have Name + Company**: Apollo → Hunter → RocketReach → FindyMail
- **Have Domain Only**: Hunter → Apollo → Clearbit
- **Have Email (need validation)**: ZeroBounce → NeverBounce → Debounce

**Quality Tiers:**
- **Premium** (90%+ success): ZoomInfo, BetterContact waterfall
- **Standard** (75%+ success): Apollo, Hunter, RocketReach
- **Budget** (60%+ success): Snov.io, Prospeo, ContactOut

#### For Company Intelligence

**Data Type Priority:**
- **Basic Firmographics**: Clearbit (fastest) → Ocean.io → Apollo
- **Financial Data**: Crunchbase → PitchBook → Dealroom
- **Technology Stack**: BuiltWith → HG Insights → Clearbit
- **Intent Signals**: B2D AI → ZoomInfo Intent → 6sense
- **News & Social**: Google News → Social platforms → Owler

**Industry Specialization:**
- **Startups**: Crunchbase, Dealroom, AngelList
- **Enterprise**: ZoomInfo, D&B, HG Insights
- **E-commerce**: Store Leads, BuiltWith, Shopify data
- **Healthcare**: Definitive Healthcare + compliance providers
- **Financial Services**: PitchBook, S&P Capital IQ

### Credit Optimization Strategies

#### Cost Tiers
```
Tier 0 (Free): Native operations, cached data, manual inputs
Tier 1 (0.5 credits): Validation, verification, basic lookups
Tier 2 (1-2 credits): Standard enrichments (Apollo, Hunter, Clearbit)
Tier 3 (2-3 credits): Premium data (ZoomInfo, technographics, intent)
Tier 4 (3-5 credits): Enterprise intelligence (PitchBook, custom AI)
Tier 5 (5-10 credits): Specialized services (video generation, deep AI research)
```

#### Optimization Tactics

**1. Cache Everything**
- Email: 30-day cache
- Company: 90-day cache
- Intent: 7-day cache
- Static data: Indefinite cache

**2. Batch Processing**
```python
# Process in batches for volume discounts
if record_count > 1000:
    use_provider("apollo_bulk")  # 10-30% discount
elif record_count > 100:
    use_parallel_processing()
else:
    use_standard_processing()
```

**3. Smart Waterfalls**
```python
waterfall_sequence = [
    {"provider": "cache", "credits": 0},
    {"provider": "apollo", "credits": 1.5, "stop_if_success": True},
    {"provider": "hunter", "credits": 1.2, "stop_if_success": True},
    {"provider": "bettercontact", "credits": 3, "stop_if_success": True},
    {"provider": "ai_research", "credits": 5, "last_resort": True}
]
```

### Provider-Specific Optimizations

#### Apollo.io
- **Strengths**: US B2B, LinkedIn data, phone numbers
- **Weaknesses**: International coverage, personal emails
- **Tips**: Use bulk API for 10%+ discount, batch similar companies

#### ZoomInfo
- **Strengths**: Enterprise data, org charts, intent signals
- **Weaknesses**: Expensive, SMB coverage
- **Tips**: Reserve for high-value accounts, negotiate enterprise deals

#### Hunter
- **Strengths**: Domain searches, email patterns, API reliability
- **Weaknesses**: Phone numbers, detailed contact info
- **Tips**: Best for initial domain exploration, use pattern detection

#### Clearbit
- **Strengths**: Real-time API, company data, speed
- **Weaknesses**: Email discovery rates, phone numbers
- **Tips**: Great for instant enrichment, combine with others for contacts

#### BuiltWith
- **Strengths**: Technology detection, historical data, e-commerce
- **Weaknesses**: Contact information, company financials
- **Tips**: Filter accounts by technology before enrichment

### Waterfall Strategies

#### Maximum Success Waterfall
```yaml
Priority: Success rate over cost
Sequence:
  1. BetterContact (aggregates 10+ sources)
  2. ZoomInfo (if enterprise)
  3. Apollo + Hunter + RocketReach
  4. AI web research
Expected Success: 95%+
Average Cost: 8-12 credits
```

#### Balanced Waterfall
```yaml
Priority: Good success with reasonable cost
Sequence:
  1. Apollo.io
  2. Hunter (if domain match)
  3. RocketReach (if name match)
  4. Stop or continue based on confidence
Expected Success: 80%
Average Cost: 3-5 credits
```

#### Budget Waterfall
```yaml
Priority: Minimize cost
Sequence:
  1. Cache check
  2. Hunter (domain only)
  3. Free sources (Google, LinkedIn public)
  4. Stop at first result
Expected Success: 60%
Average Cost: 1-2 credits
```

### Quality Scoring Framework

```python
def calculate_data_quality_score(data, sources):
    score = 0
    
    # Multi-source validation (30 points)
    if len(sources) > 1:
        score += min(len(sources) * 10, 30)
    
    # Data completeness (30 points)
    required_fields = ["email", "phone", "title", "company"]
    score += sum(10 for field in required_fields if data.get(field))
    
    # Verification status (20 points)
    if data.get("email_verified"):
        score += 10
    if data.get("phone_verified"):
        score += 10
    
    # Recency (20 points)
    days_old = get_data_age(data)
    if days_old < 30:
        score += 20
    elif days_old < 90:
        score += 10
    
    return score
```

### Industry-Specific Provider Selection

#### SaaS/Technology
- Primary: Apollo, Clearbit, BuiltWith
- Secondary: ZoomInfo, HG Insights
- Intent: G2, TrustRadius, 6sense

#### Financial Services
- Primary: PitchBook, ZoomInfo
- Compliance: LexisNexis, D&B
- News: Bloomberg, Reuters

#### Healthcare
- Primary: Definitive Healthcare
- Compliance: NPPES, state boards
- Standard: ZoomInfo with healthcare filters

#### E-commerce
- Primary: Store Leads, BuiltWith
- Platform-specific: Shopify, Amazon seller data
- Standard: Clearbit with e-commerce signals

### Troubleshooting Common Issues

#### Low Email Discovery Rate
- Check email patterns with Hunter
- Try personal email providers
- Use AI research for executives
- Consider LinkedIn outreach instead

#### High Credit Usage
- Audit waterfall sequences
- Increase cache TTL
- Negotiate volume deals
- Use native operations first

#### Poor Data Quality
- Add verification steps
- Cross-reference multiple sources
- Set minimum confidence thresholds
- Implement human review for critical data

### Advanced Techniques

#### Hybrid Enrichment
```python
# Combine AI and traditional providers
def hybrid_enrichment(company):
    # Fast, cheap base data
    base = clearbit_lookup(company)
    
    # AI for missing pieces
    if not base.get("description"):
        base["description"] = ai_generate_description(company)
    
    # Premium for high-value
    if is_enterprise_account(base):
        base.update(zoominfo_enrich(company))
    
    return base
```

#### Progressive Enrichment
```python
# Enrich in stages based on engagement
def progressive_enrichment(lead):
    # Stage 1: Basic (on import)
    if lead.stage == "new":
        return basic_enrichment(lead)  # 1-2 credits
    
    # Stage 2: Engaged (opened email)
    elif lead.stage == "engaged":
        return standard_enrichment(lead)  # 3-5 credits
    
    # Stage 3: Qualified (booked meeting)
    elif lead.stage == "qualified":
        return comprehensive_enrichment(lead)  # 10+ credits
```

## Templates
- **Provider Cheat Sheet**: See `references/provider_cheat_sheet.md` for provider selection.
- **Cost Calculator**: See `scripts/cost_calculator.py` for estimating credit usage.
- **Integration Code Templates**:
```javascript
// JavaScript/Node.js template
const enrichContact = async (name, company) => {
  // Check cache first
  const cached = await checkCache(name, company);
  if (cached) return cached;
  
  // Try providers in sequence
  const providers = ['apollo', 'hunter', 'rocketreach'];
  
  for (const provider of providers) {
    try {
      const result = await callProvider(provider, {name, company});
      if (result.email) {
        await saveToCache(result);
        return result;
      }
    } catch (error) {
      console.log(`${provider} failed, trying next...`);
    }
  }
  
  // Fallback to AI research
  return await aiResearch(name, company);
};
```

---

## Tips

- **Pre-build waterfalls per motion** so GTM teams can call a single orchestration command rather than juggling providers.
- **Instrument cache hit rates**; alert RevOps when cache effectiveness drops below target to avoid spike in credits.
- **Rotate premium providers** each quarter to negotiate better volume discounts and diversify coverage gaps.
- **Pair enrichment with QA hooks** (e.g., verification APIs, sampling) before syncing into CRM to prevent bad data cascades.

---

*Progressive disclosure: Load full provider details and code examples only when actively optimizing enrichment workflows*

Overview

This skill optimizes provider selection, routing, and credit usage across 150+ enrichment sources to deliver company and contact intelligence efficiently. It provides a layered framework for balancing data quality and cost, with ready-made waterfalls, caching rules, and provider-specific tips. Use it to make enrichment predictable, reduce wasted credits, and improve match rates for GTM and data teams.

How this skill works

The skill inspects input types (LinkedIn URL, name+company, domain, or raw email) and routes requests through prioritized provider sequences (waterfalls) that trade off success rate and credit cost. It applies caching, batch processing, and progressive enrichment stages, and scores results with a data quality formula that weights multi-source validation, completeness, verification, and recency. There are templates and provider matrices for email discovery, company firmographics, technographics, and intent signals.

When to use it

  • Selecting or auditing provider stacks for email, phone, company, or intent enrichment
  • Designing or tuning waterfall sequences to improve hit rates while controlling credits
  • Reducing credit consumption via caching, batching, and volume provider discounts
  • Progressive enrichment based on lead engagement stage (new, engaged, qualified)
  • Choosing industry-specific providers for SaaS, healthcare, finance, or e-commerce

Best practices

  • Cache aggressively by data type (email 30d, company 90d, intent 7d, static indefinite)
  • Batch similar records to unlock bulk API discounts and parallel processing
  • Start with low-cost or cached sources, escalate to premium only for failures or high-value accounts
  • Score and require multi-source validation for critical syncs into CRM
  • Instrument cache hit rates and alert when effectiveness drops

Example use cases

  • Build a balanced waterfall for SDR outreach: Apollo → Hunter → RocketReach, stop on success
  • Audit credit usage across pipelines and reconfigure expensive providers to reserved use for account-based plays
  • Progressive enrichment: basic on import, standard after engagement, comprehensive when qualified
  • Filter accounts by technographic signal using BuiltWith before running expensive enrichments
  • Hybrid enrichment combining fast API lookups with targeted AI research for executive bios

FAQ

How do I choose a waterfall for budget vs. success?

Use the budget waterfall (cache → Hunter → public sources) for low cost and the balanced or maximum success waterfalls when you need higher match rates; reserve premium providers for high-value targets.

What caching TTLs should I use?

Common defaults: email 30 days, company 90 days, intent 7 days, static data indefinite; adjust based on your data freshness needs and hit rates.