home / skills / microck / ordinary-claude-skills / data-sourcing
This skill helps you optimize provider selection, routing, and credit usage across 150+ enrichment sources for high-quality, cost-efficient data.
npx playbooks add skill microck/ordinary-claude-skills --skill data-sourcingReview the files below or copy the command above to add this skill to your agents.
---
name: data-sourcing
description: Optimize provider selection, routing, and credit usage across 150+ enrichment sources for company/contact intelligence.
---
# Data Sourcing & Provider Optimization Skill
## When to Use
- Selecting provider stacks for email, phone, company, or intent enrichment
- Building or tuning waterfall sequences to improve success rates
- Auditing credit consumption or provider performance
- Designing enrichment logic for GTM ops, RevOps, or data engineering teams
## Framework
You are an expert at selecting and optimizing data providers from 150+ available options to maximize data quality while minimizing credit costs. Use this layered framework to keep enrichment predictable and efficient.
### Core Principles
1. **Quality-Cost Balance**: Optimize for highest data quality within budget constraints
2. **Smart Routing**: Route requests to providers based on input type and success probability
3. **Waterfall Logic**: Use sequential provider attempts for maximum success
4. **Caching Strategy**: Leverage cached data to reduce redundant API calls
5. **Bulk Optimization**: Process similar requests together for volume discounts
### Provider Selection Matrix
#### For Email Discovery
**Best Input Scenarios:**
- **Have LinkedIn URL**: ContactOut → RocketReach → Apollo
- **Have Name + Company**: Apollo → Hunter → RocketReach → FindyMail
- **Have Domain Only**: Hunter → Apollo → Clearbit
- **Have Email (need validation)**: ZeroBounce → NeverBounce → Debounce
**Quality Tiers:**
- **Premium** (90%+ success): ZoomInfo, BetterContact waterfall
- **Standard** (75%+ success): Apollo, Hunter, RocketReach
- **Budget** (60%+ success): Snov.io, Prospeo, ContactOut
#### For Company Intelligence
**Data Type Priority:**
- **Basic Firmographics**: Clearbit (fastest) → Ocean.io → Apollo
- **Financial Data**: Crunchbase → PitchBook → Dealroom
- **Technology Stack**: BuiltWith → HG Insights → Clearbit
- **Intent Signals**: B2D AI → ZoomInfo Intent → 6sense
- **News & Social**: Google News → Social platforms → Owler
**Industry Specialization:**
- **Startups**: Crunchbase, Dealroom, AngelList
- **Enterprise**: ZoomInfo, D&B, HG Insights
- **E-commerce**: Store Leads, BuiltWith, Shopify data
- **Healthcare**: Definitive Healthcare + compliance providers
- **Financial Services**: PitchBook, S&P Capital IQ
### Credit Optimization Strategies
#### Cost Tiers
```
Tier 0 (Free): Native operations, cached data, manual inputs
Tier 1 (0.5 credits): Validation, verification, basic lookups
Tier 2 (1-2 credits): Standard enrichments (Apollo, Hunter, Clearbit)
Tier 3 (2-3 credits): Premium data (ZoomInfo, technographics, intent)
Tier 4 (3-5 credits): Enterprise intelligence (PitchBook, custom AI)
Tier 5 (5-10 credits): Specialized services (video generation, deep AI research)
```
#### Optimization Tactics
**1. Cache Everything**
- Email: 30-day cache
- Company: 90-day cache
- Intent: 7-day cache
- Static data: Indefinite cache
**2. Batch Processing**
```python
# Process in batches for volume discounts
if record_count > 1000:
use_provider("apollo_bulk") # 10-30% discount
elif record_count > 100:
use_parallel_processing()
else:
use_standard_processing()
```
**3. Smart Waterfalls**
```python
waterfall_sequence = [
{"provider": "cache", "credits": 0},
{"provider": "apollo", "credits": 1.5, "stop_if_success": True},
{"provider": "hunter", "credits": 1.2, "stop_if_success": True},
{"provider": "bettercontact", "credits": 3, "stop_if_success": True},
{"provider": "ai_research", "credits": 5, "last_resort": True}
]
```
### Provider-Specific Optimizations
#### Apollo.io
- **Strengths**: US B2B, LinkedIn data, phone numbers
- **Weaknesses**: International coverage, personal emails
- **Tips**: Use bulk API for 10%+ discount, batch similar companies
#### ZoomInfo
- **Strengths**: Enterprise data, org charts, intent signals
- **Weaknesses**: Expensive, SMB coverage
- **Tips**: Reserve for high-value accounts, negotiate enterprise deals
#### Hunter
- **Strengths**: Domain searches, email patterns, API reliability
- **Weaknesses**: Phone numbers, detailed contact info
- **Tips**: Best for initial domain exploration, use pattern detection
#### Clearbit
- **Strengths**: Real-time API, company data, speed
- **Weaknesses**: Email discovery rates, phone numbers
- **Tips**: Great for instant enrichment, combine with others for contacts
#### BuiltWith
- **Strengths**: Technology detection, historical data, e-commerce
- **Weaknesses**: Contact information, company financials
- **Tips**: Filter accounts by technology before enrichment
### Waterfall Strategies
#### Maximum Success Waterfall
```yaml
Priority: Success rate over cost
Sequence:
1. BetterContact (aggregates 10+ sources)
2. ZoomInfo (if enterprise)
3. Apollo + Hunter + RocketReach
4. AI web research
Expected Success: 95%+
Average Cost: 8-12 credits
```
#### Balanced Waterfall
```yaml
Priority: Good success with reasonable cost
Sequence:
1. Apollo.io
2. Hunter (if domain match)
3. RocketReach (if name match)
4. Stop or continue based on confidence
Expected Success: 80%
Average Cost: 3-5 credits
```
#### Budget Waterfall
```yaml
Priority: Minimize cost
Sequence:
1. Cache check
2. Hunter (domain only)
3. Free sources (Google, LinkedIn public)
4. Stop at first result
Expected Success: 60%
Average Cost: 1-2 credits
```
### Quality Scoring Framework
```python
def calculate_data_quality_score(data, sources):
score = 0
# Multi-source validation (30 points)
if len(sources) > 1:
score += min(len(sources) * 10, 30)
# Data completeness (30 points)
required_fields = ["email", "phone", "title", "company"]
score += sum(10 for field in required_fields if data.get(field))
# Verification status (20 points)
if data.get("email_verified"):
score += 10
if data.get("phone_verified"):
score += 10
# Recency (20 points)
days_old = get_data_age(data)
if days_old < 30:
score += 20
elif days_old < 90:
score += 10
return score
```
### Industry-Specific Provider Selection
#### SaaS/Technology
- Primary: Apollo, Clearbit, BuiltWith
- Secondary: ZoomInfo, HG Insights
- Intent: G2, TrustRadius, 6sense
#### Financial Services
- Primary: PitchBook, ZoomInfo
- Compliance: LexisNexis, D&B
- News: Bloomberg, Reuters
#### Healthcare
- Primary: Definitive Healthcare
- Compliance: NPPES, state boards
- Standard: ZoomInfo with healthcare filters
#### E-commerce
- Primary: Store Leads, BuiltWith
- Platform-specific: Shopify, Amazon seller data
- Standard: Clearbit with e-commerce signals
### Troubleshooting Common Issues
#### Low Email Discovery Rate
- Check email patterns with Hunter
- Try personal email providers
- Use AI research for executives
- Consider LinkedIn outreach instead
#### High Credit Usage
- Audit waterfall sequences
- Increase cache TTL
- Negotiate volume deals
- Use native operations first
#### Poor Data Quality
- Add verification steps
- Cross-reference multiple sources
- Set minimum confidence thresholds
- Implement human review for critical data
### Advanced Techniques
#### Hybrid Enrichment
```python
# Combine AI and traditional providers
def hybrid_enrichment(company):
# Fast, cheap base data
base = clearbit_lookup(company)
# AI for missing pieces
if not base.get("description"):
base["description"] = ai_generate_description(company)
# Premium for high-value
if is_enterprise_account(base):
base.update(zoominfo_enrich(company))
return base
```
#### Progressive Enrichment
```python
# Enrich in stages based on engagement
def progressive_enrichment(lead):
# Stage 1: Basic (on import)
if lead.stage == "new":
return basic_enrichment(lead) # 1-2 credits
# Stage 2: Engaged (opened email)
elif lead.stage == "engaged":
return standard_enrichment(lead) # 3-5 credits
# Stage 3: Qualified (booked meeting)
elif lead.stage == "qualified":
return comprehensive_enrichment(lead) # 10+ credits
```
## Templates
- **Provider Cheat Sheet**: See `references/provider_cheat_sheet.md` for provider selection.
- **Cost Calculator**: See `scripts/cost_calculator.py` for estimating credit usage.
- **Integration Code Templates**:
```javascript
// JavaScript/Node.js template
const enrichContact = async (name, company) => {
// Check cache first
const cached = await checkCache(name, company);
if (cached) return cached;
// Try providers in sequence
const providers = ['apollo', 'hunter', 'rocketreach'];
for (const provider of providers) {
try {
const result = await callProvider(provider, {name, company});
if (result.email) {
await saveToCache(result);
return result;
}
} catch (error) {
console.log(`${provider} failed, trying next...`);
}
}
// Fallback to AI research
return await aiResearch(name, company);
};
```
---
## Tips
- **Pre-build waterfalls per motion** so GTM teams can call a single orchestration command rather than juggling providers.
- **Instrument cache hit rates**; alert RevOps when cache effectiveness drops below target to avoid spike in credits.
- **Rotate premium providers** each quarter to negotiate better volume discounts and diversify coverage gaps.
- **Pair enrichment with QA hooks** (e.g., verification APIs, sampling) before syncing into CRM to prevent bad data cascades.
---
*Progressive disclosure: Load full provider details and code examples only when actively optimizing enrichment workflows*
This skill optimizes provider selection, routing, and credit usage across 150+ enrichment sources to deliver company and contact intelligence efficiently. It provides a layered framework for balancing data quality and cost, with ready-made waterfalls, caching rules, and provider-specific tips. Use it to make enrichment predictable, reduce wasted credits, and improve match rates for GTM and data teams.
The skill inspects input types (LinkedIn URL, name+company, domain, or raw email) and routes requests through prioritized provider sequences (waterfalls) that trade off success rate and credit cost. It applies caching, batch processing, and progressive enrichment stages, and scores results with a data quality formula that weights multi-source validation, completeness, verification, and recency. There are templates and provider matrices for email discovery, company firmographics, technographics, and intent signals.
How do I choose a waterfall for budget vs. success?
Use the budget waterfall (cache → Hunter → public sources) for low cost and the balanced or maximum success waterfalls when you need higher match rates; reserve premium providers for high-value targets.
What caching TTLs should I use?
Common defaults: email 30 days, company 90 days, intent 7 days, static data indefinite; adjust based on your data freshness needs and hit rates.