home / skills / transilienceai / communitytools / web_archive_analysis

This skill analyzes Wayback snapshots to detect technology migrations over time, revealing historical stack changes and migration patterns for a domain.

npx playbooks add skill transilienceai/communitytools --skill web_archive_analysis

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
7.3 KB
---
name: web-archive-analysis
description: Uses Wayback Machine to detect technology migrations over time
tools: Bash, WebFetch
model: inherit
hooks:
  PreToolUse:
    - matcher: "WebFetch"
      hooks:
        - type: command
          command: "../../../hooks/skills/pre_rate_limit_hook.sh"
  PostToolUse:
    - matcher: "WebFetch"
      hooks:
        - type: command
          command: "../../../hooks/skills/post_skill_logging_hook.sh"
---

# Web Archive Analysis Skill

## Purpose

Query the Wayback Machine to discover historical technology usage and detect technology migrations over time.

## Operations

### 1. query_cdx_api

Get historical snapshots from the Wayback Machine CDX API.

**Endpoint:**
```
GET http://web.archive.org/cdx/search/cdx
```

**Parameters:**
```
url: {domain}
output: json
filter: statuscode:200
collapse: timestamp:6  # Group by month (YYYYMM)
limit: 100
from: {start_year}
to: {end_year}
```

**Example Request:**
```bash
curl "http://web.archive.org/cdx/search/cdx?url=example.com&output=json&filter=statuscode:200&collapse=timestamp:6&limit=100"
```

**Response Format:**
```json
[
  ["urlkey", "timestamp", "original", "mimetype", "statuscode", "digest", "length"],
  ["com,example)/", "20240115120000", "https://example.com/", "text/html", "200", "ABC123...", "45678"]
]
```

### 2. select_snapshots

Choose representative snapshots for analysis.

**Selection Strategy:**
```python
def select_snapshots(all_snapshots):
    # Get snapshots at regular intervals
    intervals = [
        "6 months ago",
        "1 year ago",
        "2 years ago",
        "3 years ago",
        "5 years ago"
    ]

    selected = []
    for interval in intervals:
        target_date = calculate_date(interval)
        closest = find_closest_snapshot(all_snapshots, target_date)
        if closest:
            selected.append(closest)

    return selected
```

**Snapshot Priority:**
1. Recent (baseline for comparison)
2. 1 year ago (detect recent changes)
3. 2-3 years ago (medium-term evolution)
4. 5+ years ago (historical context)

### 3. fetch_archived_content

Retrieve archived pages for analysis.

**Wayback URL Format:**
```
https://web.archive.org/web/{timestamp}/{original_url}
```

**Example:**
```
https://web.archive.org/web/20230115120000/https://example.com/
```

**Headers to Request:**
```
Accept: text/html
User-Agent: TechStackAgent/1.0 (OSINT research)
```

### 4. compare_snapshots

Detect technology changes between snapshots.

**Comparison Points:**
```json
{
  "headers_to_compare": [
    "Server",
    "X-Powered-By",
    "Set-Cookie"
  ],
  "html_elements": [
    "meta[name=generator]",
    "script[src]",
    "link[href]"
  ],
  "patterns_to_track": [
    "/wp-content/",
    "/_next/",
    "/_nuxt/",
    "/static/js/"
  ]
}
```

**Change Detection:**
```python
def detect_changes(old_snapshot, new_snapshot):
    changes = []

    # Compare technologies
    old_tech = extract_technologies(old_snapshot)
    new_tech = extract_technologies(new_snapshot)

    added = new_tech - old_tech
    removed = old_tech - new_tech

    for tech in added:
        changes.append({
            "type": "technology_added",
            "technology": tech,
            "first_seen": new_snapshot.timestamp
        })

    for tech in removed:
        changes.append({
            "type": "technology_removed",
            "technology": tech,
            "last_seen": old_snapshot.timestamp
        })

    return changes
```

### 5. detect_migrations

Identify framework/platform migrations.

**Common Migration Patterns:**
```json
{
  "WordPress → Custom/React": {
    "indicators": [
      "/wp-content/ disappears",
      "React globals appear",
      "/_next/ or /static/js/ paths"
    ],
    "typical_timeline": "6-18 months"
  },
  "AngularJS → Angular": {
    "indicators": [
      "ng-app disappears",
      "ng-version appears",
      "Angular 2+ patterns"
    ],
    "typical_timeline": "12-24 months"
  },
  "jQuery → React/Vue": {
    "indicators": [
      "jQuery CDN removed",
      "Modern framework globals",
      "SPA patterns"
    ],
    "typical_timeline": "6-12 months"
  },
  "On-prem → Cloud": {
    "indicators": [
      "CloudFront/Cloudflare headers appear",
      "AWS/GCP/Azure signatures",
      "CDN usage"
    ],
    "typical_timeline": "3-12 months"
  }
}
```

### 6. extract_historical_tech

Parse archived HTML for technology signals.

**Process:**
1. Fetch archived page
2. Apply same analysis as html_content_analysis skill
3. Record technologies with timestamp
4. Build timeline of technology usage

## Output

```json
{
  "skill": "web_archive_analysis",
  "domain": "string",
  "results": {
    "archive_coverage": {
      "oldest_snapshot": "2015-03-15",
      "newest_snapshot": "2024-01-10",
      "total_snapshots": 450,
      "snapshots_analyzed": 5
    },
    "snapshots_analyzed": [
      {
        "timestamp": "2024-01-10",
        "url": "https://web.archive.org/web/20240110/...",
        "technologies_detected": ["Next.js", "React", "Vercel"]
      },
      {
        "timestamp": "2022-06-15",
        "url": "https://web.archive.org/web/20220615/...",
        "technologies_detected": ["React", "Create React App", "Heroku"]
      },
      {
        "timestamp": "2020-01-20",
        "url": "https://web.archive.org/web/20200120/...",
        "technologies_detected": ["WordPress", "PHP"]
      }
    ],
    "technology_timeline": [
      {
        "technology": "WordPress",
        "first_seen": "2015-03-15",
        "last_seen": "2020-06-01",
        "status": "removed"
      },
      {
        "technology": "React",
        "first_seen": "2020-03-01",
        "last_seen": "present",
        "status": "current"
      },
      {
        "technology": "Next.js",
        "first_seen": "2023-01-15",
        "last_seen": "present",
        "status": "current"
      }
    ],
    "migrations_detected": [
      {
        "type": "CMS → Modern Framework",
        "from": "WordPress",
        "to": "React/Next.js",
        "approximate_date": "2020-Q1 to 2020-Q2",
        "confidence": 85
      },
      {
        "type": "Hosting Migration",
        "from": "Heroku",
        "to": "Vercel",
        "approximate_date": "2023-Q1",
        "confidence": 80
      }
    ],
    "current_vs_historical": {
      "current_stack": ["Next.js", "React", "Vercel"],
      "historical_stack": ["WordPress", "PHP", "Heroku"],
      "major_changes": 2
    }
  },
  "evidence": [
    {
      "type": "archived_snapshot",
      "timestamp": "string",
      "archive_url": "string",
      "technologies": ["array"],
      "analysis_timestamp": "ISO-8601"
    }
  ]
}
```

## Rate Limiting

- Wayback CDX API: 15 requests/minute
- Archived page fetches: 10/minute
- Cache CDX results to avoid repeated queries

## Error Handling

- 404: Domain not archived
- 503: Wayback Machine overloaded - retry with backoff
- Timeout: Increase timeout for archived pages (can be slow)
- Continue with available snapshots on partial failures

## Security Considerations

- Only access public archives
- Respect Wayback Machine rate limits
- Do not store archived content beyond analysis
- Note that archived content may contain outdated security vulnerabilities
- Log all queries for audit

## Confidence Notes

Historical data provides **contextual signals**:
- Confirms technology transitions
- Validates current technology choices
- Lower weight than current direct evidence
- Base confidence: 60-75%

Overview

This skill uses the Wayback Machine to reconstruct a domain's historical technology footprint and detect migrations over time. It queries the CDX API for archived snapshots, fetches representative pages, and builds a timeline of detected frameworks, hosting changes, and frontend stacks. Outputs include snapshot coverage, per-snapshot technology lists, detected migrations, and confidence estimates.

How this skill works

The skill queries the Wayback CDX API with filters (status 200, monthly collapse) to collect available snapshots between specified years. It selects representative snapshots at regular intervals, fetches archived HTML pages, and extracts signals from headers, HTML elements, script/link paths, and common platform patterns. Changes are detected by comparing extracted technologies across snapshots and mapping indicator patterns to known migration types.

When to use it

  • Investigating historical platform or CMS migrations for a target domain
  • Validating when a site moved from on-prem to cloud/CDN or changed hosting provider
  • Building timelines for security research, bug bounty reconnaissance, or pentest planning
  • Confirming disappearance or introduction of specific libraries, frameworks, or CDN footprints
  • Correlating technology transitions with known incidents or upgrade windows

Best practices

  • Cache CDX responses and fetched pages to respect rate limits and reduce repeat queries
  • Analyze a blend of recent, 1-3 year, and 5+ year snapshots to capture short- and long-term trends
  • Request archived pages with a clear User-Agent and Accept headers to avoid accidental blocks
  • Treat archive signals as contextual — corroborate findings with live analysis for high-confidence claims
  • Gracefully handle API errors (503, timeouts) with exponential backoff and continue with available data

Example use cases

  • Detecting a migration from WordPress to a React/Next.js stack and estimating when it occurred
  • Finding when a site adopted CloudFront/Cloudflare headers to infer a cloud/CDN migration timeline
  • Tracing removal of jQuery CDN references and appearance of SPA assets to confirm front-end rewrites
  • Producing a technology timeline for a target to prioritize vulnerability checks on legacy components
  • Correlating hosting changes (Heroku → Vercel) with observed configuration or header differences

FAQ

How accurate are migration dates from archived snapshots?

Dates are approximate and tied to snapshot availability; they provide a contextual window (confidence typically 60–85%) rather than exact cutover times.

What rate limits should I respect when using the Wayback APIs?

Keep CDX queries under ~15 requests/minute and archived page fetches under ~10/minute; cache results to avoid repeated requests.