home / skills / madappgang / claude-code / data-extraction-patterns

data-extraction-patterns skill

/plugins/seo/skills/data-extraction-patterns

This skill helps you extract analytics data from GA4 and GSC efficiently, handling APIs, caching, and parallel fetch to power reliable insights.

npx playbooks add skill madappgang/claude-code --skill data-extraction-patterns

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
10.8 KB
---
plugin: seo
updated: 2026-01-20
name: data-extraction-patterns
description: Common patterns for extracting analytics data from GA4 and GSC with API handling
---
plugin: seo
updated: 2026-01-20

# Data Extraction Patterns

## When to Use

- Setting up analytics data pipelines
- Combining data from multiple sources
- Handling API rate limits and errors
- Caching frequently accessed data
- Building data collection workflows

## API Reference

### Google Analytics 4 (GA4)

**MCP Server**: `mcp-server-google-analytics`

**Key Operations**:
```
get_report({
  propertyId: "properties/123456789",
  dateRange: { startDate: "30daysAgo", endDate: "today" },
  dimensions: ["pagePath", "date"],
  metrics: ["screenPageViews", "averageSessionDuration", "bounceRate"]
})
```

**Useful Metrics**:
| Metric | Description | Use Case |
|--------|-------------|----------|
| screenPageViews | Total page views | Traffic volume |
| sessions | User sessions | Visitor count |
| averageSessionDuration | Avg time in session | Engagement |
| bounceRate | Single-page visits | Content quality |
| engagementRate | Engaged sessions % | True engagement |
| scrolledUsers | Users who scrolled | Content consumption |

**Useful Dimensions**:
| Dimension | Description |
|-----------|-------------|
| pagePath | URL path |
| date | Date (for trending) |
| sessionSource | Traffic source |
| deviceCategory | Desktop/mobile/tablet |

### Google Search Console (GSC)

**MCP Server**: `mcp-server-gsc`

**Key Operations**:
```
search_analytics({
  siteUrl: "https://example.com",
  startDate: "2025-11-27",
  endDate: "2025-12-27",
  dimensions: ["query", "page"],
  rowLimit: 1000
})

get_url_inspection({
  siteUrl: "https://example.com",
  inspectionUrl: "https://example.com/page"
})
```

**Available Metrics**:
| Metric | Description | Use Case |
|--------|-------------|----------|
| clicks | Total clicks from search | Traffic from Google |
| impressions | Times shown in results | Visibility |
| ctr | Click-through rate | Snippet effectiveness |
| position | Average ranking | SEO success |

**Dimensions**:
| Dimension | Description |
|-----------|-------------|
| query | Search query |
| page | Landing page URL |
| country | User country |
| device | Desktop/mobile/tablet |
| date | Date (for trending) |

## Parallel Execution Pattern

### Optimal Data Fetch (All Sources)

```markdown
## Parallel Data Fetch Pattern

When fetching from multiple sources, issue all requests in a SINGLE message
for parallel execution:

┌─────────────────────────────────────────────────────────────────┐
│  MESSAGE 1: Parallel Data Requests                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  [MCP Call 1]: google-analytics.get_report(...)                 │
│  [MCP Call 2]: google-search-console.search_analytics(...)      │
│                                                                  │
│  → All execute simultaneously                                    │
│  → Results return when all complete                              │
│  → ~2x faster than sequential                                    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

### Sequential (When Needed)

Some operations require sequential execution:

```markdown
## Sequential Pattern (Dependencies)

When one request depends on another's result:

┌─────────────────────────────────────────────────────────────────┐
│  MESSAGE 1: Get list of pages                                   │
│  → Returns: ["/page1", "/page2", "/page3"]                      │
├─────────────────────────────────────────────────────────────────┤
│  MESSAGE 2: Get details for each page                           │
│  → Uses page list from Message 1                                │
│  → Can parallelize within this message                          │
└─────────────────────────────────────────────────────────────────┘
```

## Rate Limiting

### API Rate Limits

| API | Limit | Strategy |
|-----|-------|----------|
| GA4 | 10 QPS per property | Batch dimensions |
| GSC | 1,200 requests/min | Paginate large exports |

### Retry Pattern

```bash
#!/bin/bash
# Retry with exponential backoff

MAX_RETRIES=3
RETRY_DELAY=5

fetch_with_retry() {
    local url="$1"
    local attempt=1

    while [ $attempt -le $MAX_RETRIES ]; do
        response=$(curl -s -w "%{http_code}" -o /tmp/response.json "$url")
        http_code="${response: -3}"

        if [ "$http_code" = "200" ]; then
            cat /tmp/response.json
            return 0
        elif [ "$http_code" = "429" ]; then
            echo "Rate limited, waiting ${RETRY_DELAY}s..." >&2
            sleep $RETRY_DELAY
            RETRY_DELAY=$((RETRY_DELAY * 2))
        else
            echo "Error: HTTP $http_code" >&2
            return 1
        fi

        attempt=$((attempt + 1))
    done

    echo "Max retries exceeded" >&2
    return 1
}
```

## Caching Pattern

### Session-Based Cache

```bash
# Cache structure
SESSION_PATH="/tmp/seo-performance-20251227-143000-example"
CACHE_DIR="${SESSION_PATH}/cache"
CACHE_TTL=3600  # 1 hour in seconds

mkdir -p "$CACHE_DIR"

# Cache key generation
cache_key() {
    echo "$1" | md5sum | cut -d' ' -f1
}

# Check cache
get_cached() {
    local key=$(cache_key "$1")
    local cache_file="${CACHE_DIR}/${key}.json"

    if [ -f "$cache_file" ]; then
        local age=$(($(date +%s) - $(stat -f%m "$cache_file" 2>/dev/null || stat -c%Y "$cache_file")))
        if [ $age -lt $CACHE_TTL ]; then
            cat "$cache_file"
            return 0
        fi
    fi
    return 1
}

# Save to cache
save_cache() {
    local key=$(cache_key "$1")
    local cache_file="${CACHE_DIR}/${key}.json"
    cat > "$cache_file"
}

# Usage
CACHE_KEY="ga4_${URL}_${DATE_RANGE}"
if ! RESULT=$(get_cached "$CACHE_KEY"); then
    RESULT=$(fetch_from_api)
    echo "$RESULT" | save_cache "$CACHE_KEY"
fi
```

## Date Range Standardization

### Common Date Ranges

```bash
# Standard date range calculations
TODAY=$(date +%Y-%m-%d)

case "$RANGE" in
    "7d")
        START_DATE=$(date -v-7d +%Y-%m-%d 2>/dev/null || date -d "7 days ago" +%Y-%m-%d)
        ;;
    "30d")
        START_DATE=$(date -v-30d +%Y-%m-%d 2>/dev/null || date -d "30 days ago" +%Y-%m-%d)
        ;;
    "90d")
        START_DATE=$(date -v-90d +%Y-%m-%d 2>/dev/null || date -d "90 days ago" +%Y-%m-%d)
        ;;
    "mtd")
        START_DATE=$(date +%Y-%m-01)
        ;;
    "ytd")
        START_DATE=$(date +%Y-01-01)
        ;;
esac

END_DATE="$TODAY"
```

### API-Specific Formats

| API | Format | Example |
|-----|--------|---------|
| GA4 | Relative or ISO | "30daysAgo", "2025-12-01" |
| GSC | ISO 8601 | "2025-12-01" |

## Graceful Degradation

### Data Source Fallback

```markdown
## Fallback Strategy

When a data source is unavailable:

┌─────────────────────────────────────────────────────────────────┐
│  PRIMARY SOURCE      │  FALLBACK           │  LAST RESORT       │
├──────────────────────┼─────────────────────┼────────────────────┤
│  GA4 traffic data    │  GSC clicks         │  Estimate from GSC │
│  GSC search perf     │  Manual SERP check  │  WebSearch SERP    │
│  CWV (CrUX)          │  PageSpeed API      │  Lighthouse CLI    │
└──────────────────────┴─────────────────────┴────────────────────┘
```

### Partial Data Output

```markdown
## Analysis Report (Partial Data)

### Data Availability

| Source | Status | Impact |
|--------|--------|--------|
| GA4 | NOT CONFIGURED | Missing engagement metrics |
| GSC | AVAILABLE | Full search data |

### Analysis Notes

This analysis is based on limited data sources:
- Search performance metrics are complete (GSC)
- Engagement metrics unavailable (no GA4)

**Recommendation**: Configure GA4 for complete analysis.
Run `/setup-analytics` to add Google Analytics.
```

## Unified Data Model

### Combined Output Structure

```json
{
  "metadata": {
    "url": "https://example.com/page",
    "fetchedAt": "2025-12-27T14:30:00Z",
    "dateRange": {
      "start": "2025-11-27",
      "end": "2025-12-27"
    }
  },
  "sources": {
    "ga4": {
      "available": true,
      "metrics": {
        "pageViews": 2450,
        "avgTimeOnPage": 222,
        "bounceRate": 38.2,
        "engagementRate": 64.5
      }
    },
    "gsc": {
      "available": true,
      "metrics": {
        "impressions": 15200,
        "clicks": 428,
        "ctr": 2.82,
        "avgPosition": 4.2
      },
      "topQueries": [
        {"query": "seo guide", "clicks": 156, "position": 4}
      ]
    }
  },
  "computed": {
    "healthScore": 72,
    "status": "GOOD"
  }
}
```

## Error Handling

### Common Errors

| Error | Cause | Resolution |
|-------|-------|------------|
| 401 Unauthorized | Invalid/expired credentials | Re-run /setup-analytics |
| 403 Forbidden | Missing permissions | Check API access in console |
| 429 Too Many Requests | Rate limit | Wait and retry with backoff |
| 404 Not Found | Invalid property/site | Verify IDs in configuration |
| 500 Server Error | API issue | Retry later, check status page |

### Error Output Pattern

```markdown
## Data Fetch Error

**Source**: Google Analytics 4
**Error**: 403 Forbidden
**Message**: "User does not have sufficient permissions for this property"

### Troubleshooting Steps

1. Verify Service Account email in GA4 Admin
2. Ensure "Viewer" role is granted
3. Check Analytics Data API is enabled
4. Wait 5 minutes for permission propagation

### Workaround

Proceeding with available data sources (GSC).
GA4 engagement metrics will not be included in this analysis.
```

Overview

This skill documents common patterns for extracting analytics data from Google Analytics 4 (GA4) and Google Search Console (GSC) with pragmatic API handling. It focuses on reliable pipelines: parallel and sequential fetching, rate-limit handling, caching, date standardization, and graceful degradation. The material targets engineers building TypeScript-based connectors and analytics workflows.

How this skill works

It describes concrete API call shapes for GA4 and GSC, strategies to issue multiple MCP server requests in parallel, and patterns for sequential flows when results are dependent. It provides reusable patterns for retries with exponential backoff, session-based caching, standardized date ranges, and a unified output model that merges source metrics into a consistent JSON structure. Error handling and fallback behavior are included to keep pipelines resilient.

When to use it

  • Building analytics pipelines that combine GA4 event metrics and GSC search performance
  • Fetching from multiple APIs simultaneously to reduce overall latency
  • Implementing retry, backoff, and rate-limit strategies for stable data ingestion
  • Caching results for short-lived dashboards or repeated requests to avoid quota spikes
  • Creating a unified data model for downstream reporting or ML features

Best practices

  • Issue independent API requests in a single parallel message to reduce wall-clock time
  • Use exponential backoff on 429 responses and cap retries to prevent cascade failures
  • Apply short-lived session caching keyed by request params to reduce duplicate calls
  • Standardize date ranges (7d, 30d, mtd, ytd) and convert to API formats per service
  • Design graceful fallback order (GA4 → GSC → estimates) and emit partial results with status

Example use cases

  • Daily ETL job that combines GA4 pageViews and GSC clicks into a single report
  • On-demand page-level health check that pulls GA4 engagement and GSC position in parallel
  • Bulk export with pagination and rate-limit batching for historical analysis
  • Interactive dashboard that caches recent queries for one hour to stay within quotas
  • Automated monitoring that falls back to GSC when GA4 is missing and marks partial outputs

FAQ

How do I choose parallel vs sequential execution?

Run independent requests in parallel for speed; use sequential flows when later calls depend on earlier results, then parallelize within dependent batches.

What rate limits should I plan for?

Assume GA4 ~10 QPS per property and GSC ~1,200 requests/min; batch metrics, paginate exports, and add backoff on 429 to stay safe.

How does caching integrate with retries and freshness?

Use short TTLs (e.g., 1 hour) and key caches by URL/date-range. Always bypass cache on authentication errors and refresh stale data after successful retry.