home / skills / terrylica / cc-skills / fork-intelligence

fork-intelligence skill

safe

/plugins/gh-tools/skills/fork-intelligence

This skill uncovers valuable GitHub forks by analyzing branch divergence and upstream activity to reveal meaningful, non-starred work.

npx playbooks add skill terrylica/cc-skills --skill fork-intelligence

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

14.1 KB

---
name: fork-intelligence
description: Discover valuable GitHub fork divergence beyond stars. TRIGGERS - fork analysis, fork intelligence, find forks, valuable forks, fork divergence, fork discovery, upstream forks.
allowed-tools: Read, Bash, Grep, Glob
---

# Fork Intelligence

Systematic methodology for discovering valuable work in GitHub fork ecosystems. Stars-only filtering misses 60-100% of substantive forks — this skill uses branch-level divergence analysis, upstream PR cross-referencing, and domain-specific heuristics to find what matters.

Validated empirically across 10 repositories spanning Python, Rust, TypeScript, C++/Python, and Node.js (tensortrade, backtesting.py, kokoro, pymoo, firecrawl, barter-rs, pueue, dukascopy-node, ArcticDB, flowsurface).

## FIRST — TodoWrite Task Templates

**MANDATORY**: Select and load the appropriate template before any fork analysis.

### Template A — Full Analysis (new repository)

```
1. Get upstream baseline (stars, forks, default branch, last push)
2. List all forks with pagination, note timestamp clusters
3. Filter to unique-timestamp forks (skip bulk mirrors)
4. Check default branch divergence (ahead_by/behind_by)
5. Check non-default branches for all forks with recent push or >1 branch
6. Evaluate commit content, author emails, tags/releases
7. Cross-reference upstream PR history from fork owners
8. Tier ranking and cross-fork convergence analysis
9. Produce report with actionable recommendations
```

### Template B — Quick Scan (triage only)

```
1. Get upstream baseline
2. List forks, filter by timestamp clustering
3. Check default branch divergence only
4. Report forks with ahead_by > 0
```

### Template C — Targeted Fork Evaluation (specific fork)

```
1. Compare fork vs upstream on all branches
2. Examine commit messages and changed files
3. Check for tags/releases, open issues, PRs
4. Assess cherry-pick viability
```

---

## Signal Priority Order

Ranked by empirical reliability across 10 repositories. See [signal-priority.md](./references/signal-priority.md) for details.

| Rank | Signal                          | Reliability | What It Catches                                      |
| ---- | ------------------------------- | ----------- | ---------------------------------------------------- |
| 1    | **Branch-level divergence**     | Highest     | Work on feature branches (50%+ of substantive forks) |
| 2    | **Upstream PR cross-reference** | High        | Rebased/force-pushed work invisible to compare API   |
| 3    | **Tags/releases on fork**       | High        | Independent maintenance intent                       |
| 4    | **Commit email domains**        | High        | Institutional contributors (`@company.com`)          |
| 5    | **Timestamp clustering**        | Medium      | Eliminates 85%+ mirror noise                         |
| 6    | **Cross-fork convergence**      | Medium      | Reveals unmet upstream demand                        |
| 7    | **Stars**                       | Lowest      | Often anti-correlated with actual value              |

---

## Pipeline — 7 Steps

### Step 1: Upstream Baseline

```bash
UPSTREAM="OWNER/REPO"
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, default_branch, stargazers_count}'
```

### Step 2: List All Forks + Timestamp Clustering

```bash
# List all forks with activity signals
gh api "repos/$UPSTREAM/forks" --paginate \
  --jq '.[] | {full_name, pushed_at, stargazers_count, default_branch}'
```

**Timestamp clustering**: Forks sharing exact `pushed_at` with upstream are bulk mirrors created by GitHub's fork mechanism and never touched. Group by `pushed_at` — forks with unique timestamps warrant investigation. This alone eliminates 85%+ of noise.

```bash
# Filter to unique-timestamp forks (skip bulk mirrors)
gh api "repos/$UPSTREAM/forks" --paginate \
  --jq '.[] | {full_name, pushed_at, stargazers_count}' | \
  jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten'
```

### Step 3: Default Branch Divergence

```bash
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')

# For each candidate fork
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:$BRANCH" \
  --jq '{ahead_by, behind_by, status}'
```

The `status` field meanings:

- `identical` — pure mirror, skip
- `behind` — stale mirror, skip
- `diverged` — has original commits AND is behind (interesting)
- `ahead` — has original commits, up-to-date with upstream (rare, most valuable)

**Important**: Always compare from the upstream repo's perspective (`repos/UPSTREAM/compare/...`). The reverse direction (`repos/FORK/compare/...`) returns 404 for some repositories.

### Step 4: Non-Default Branch Analysis (CRITICAL)

**This is the single biggest methodology improvement.** Across all 10 repos tested, 50%+ of the most valuable fork work lived exclusively on feature branches.

Examples:

- flowsurface/aviu16: 7,000-line GPU shader heatmap only on `shader-heatmap`
- ArcticDB/DerThorsten: 147 commits across `conda_build`, `clang`, `apple_changes`
- pueue/FrancescElies: Duration display only on `cesc/duration`
- barter-rs: 6 of 12 top forks had work only on feature branches

```bash
# List branches on a fork
gh api "repos/FORK_OWNER/REPO/branches" --jq '.[].name' | head -20

# Check divergence on a specific branch
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:FEATURE_BRANCH" \
  --jq '{ahead_by, behind_by, status}'
```

**Heuristics for which forks need branch checks**:

- Any fork with `pushed_at` more recent than upstream but `ahead_by == 0` on default branch
- Any fork with more than 1 branch
- Branch count > 10 is suspicious — likely non-trivial work (ArcticDB: Rohan-flutterint had 197 branches)

### Step 5: Commit Content Evaluation

```bash
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:BRANCH" \
  --jq '.commits[] | {sha: .sha[:8], message: .commit.message | split("\n")[0], date: .commit.committer.date[:10], author: .commit.author.email}'
```

**What to look for**:

- Commit email domains reveal institutional contributors (`@man.com`, `@quantstack.net`)
- Subtract merge commits from ahead_by count (e.g., akeda2/pueue showed 35 ahead but 28 were upstream merges)
- Build system changes (`CMakeLists.txt`, `Cargo.toml`, `pyproject.toml`) indicate platform enablement
- Protobuf schema changes indicate architectural-level features
- Test files alongside source changes signal production-intent work

### Step 6: Fork-Specific Signals

```bash
# Tags/releases (strongest independent maintenance signal)
gh api "repos/FORK_OWNER/REPO/tags" --jq '.[].name' | head -10
gh api "repos/FORK_OWNER/REPO/releases" --jq '.[] | {tag_name, name, published_at}' | head -5

# Open issues on the fork (signals independent project maintenance)
gh api "repos/FORK_OWNER/REPO/issues?state=open" --jq 'length'

# Check if repo was renamed (strong divergence intent signal)
gh api "repos/FORK_OWNER/REPO" --jq '.name'
```

| Signal                    | Strength                  | Example                                 |
| ------------------------- | ------------------------- | --------------------------------------- |
| Tags/releases on fork     | Highest                   | pueue/freesrz93 had 6 releases          |
| Open PRs against upstream | High                      | Formal proposals with review context    |
| Open issues on the fork   | High                      | Independent project maintenance         |
| Repo renamed              | Medium                    | flowsurface/sinaha81 became volume_flow |
| Build config changes      | High (compiled languages) | Cargo.toml, CMakeLists.txt diff         |
| Description changed       | Weak                      | Many vanity renames with no code        |

### Step 7: Cross-Fork Convergence + Upstream PR History

```bash
# Check upstream PRs from fork owners
gh api "repos/$UPSTREAM/pulls?state=all" --paginate \
  --jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'
```

**Cross-fork convergence**: When multiple forks independently solve the same problem, it signals unmet upstream demand:

- firecrawl: 3 forks adopted Patchright for anti-detection
- flowsurface: 3 forks added technical indicators independently
- kokoro: 2 independent batched inference implementations
- barter-rs: 4 forks added Bybit support

**Upstream PR cross-reference catches**:

- Rebased/force-pushed work invisible to compare API
- Work that was merged upstream (fork shows 0 ahead but was historically significant)
- Declined PRs with valuable code that the fork still maintains

---

## Tier Classification

After running the pipeline, classify forks into tiers:

| Tier                          | Criteria                                                  | Action                                  |
| ----------------------------- | --------------------------------------------------------- | --------------------------------------- |
| **Tier 1: Major Extensions**  | New features, architectural changes, >10 original commits | Deep evaluation, cherry-pick candidates |
| **Tier 2: Targeted Features** | Focused additions, bug fixes, 2-10 commits                | Cherry-pick individual commits          |
| **Tier 3: Infrastructure**    | CI/CD, packaging, deployment, docs                        | Evaluate if relevant to your setup      |
| **Tier 4: Historical**        | Merged upstream or stale but once significant             | Note for context, no action needed      |

---

## Domain-Specific Patterns

Different codebases exhibit different fork behaviors. See [domain-patterns.md](./references/domain-patterns.md) for full details.

| Domain                      | Key Pattern                                                                | Example                                               |
| --------------------------- | -------------------------------------------------------------------------- | ----------------------------------------------------- |
| **Scientific/ML**           | Researchers fork-implement-publish-vanish, zero social engagement          | pymoo: 300-file fork with 0 stars                     |
| **Trading/Finance**         | Exchange connectors dominate; best forks are private                       | barter-rs: 4 independent Bybit impls                  |
| **Infrastructure/DevTools** | Self-hosting/SaaS-removal is the dominant theme                            | firecrawl: devflowinc/firecrawl-simple (630 stars)    |
| **C++/Python Mixed**        | Feature work lives on branches; email domains reveal institutions          | ArcticDB: @man.com, @quantstack.net                   |
| **Node.js Libraries**       | Check npm publication as separate packages                                 | dukascopy-node: kyo06 published `dukascopy-node-plus` |
| **Rust CLI**                | Cargo.toml diff is reliable quick filter; "superset" forks add subcommands | pueue: freesrz93 added 7 subcommands                  |

---

## Quick-Scan Pipeline (5-minute triage)

For rapid triage of any new repo:

```bash
UPSTREAM="OWNER/REPO"
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')

# 1. Baseline
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, stargazers_count}'

# 2. Forks with unique timestamps (skip mirrors)
gh api "repos/$UPSTREAM/forks" --paginate \
  --jq '.[] | {full_name, pushed_at, stargazers_count}' | \
  jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten | sort_by(.pushed_at) | reverse'

# 3. Check ahead_by for each candidate
# (loop over candidates from step 2)

# 4. Check upstream PRs from fork authors
gh api "repos/$UPSTREAM/pulls?state=all" --paginate \
  --jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'
```

---

## Known Limitations

| Limitation                                      | Impact                                 | Workaround                                                        |
| ----------------------------------------------- | -------------------------------------- | ----------------------------------------------------------------- |
| GitHub compare API 250-commit limit             | Highly divergent forks may truncate    | Use `gh api repos/FORK/commits?per_page=1` to get total count     |
| Private forks invisible                         | Trading firms keep best work private   | Accepted limitation                                               |
| Force-pushed branches break compare API         | Shows 0 ahead despite significant work | Cross-reference upstream PR history                               |
| Renamed forks may break API calls               | Old URLs may 404                       | Use `gh api repos/FORK_OWNER/REPO --jq '.name'` to detect renames |
| Rate limiting on large fork ecosystems          | >1000 forks = many API calls           | Use timestamp clustering to reduce calls by 85%+                  |
| Maintainer dev forks look like independent work | Branch names 1:1 with upstream PRs     | Cross-reference branch names against upstream PR branch names     |

---

## Report Template

Use this structure for the final analysis report:

```markdown
# Fork Analysis Report: OWNER/REPO

**Repository**: OWNER/REPO (N stars, M forks)
**Analysis date**: YYYY-MM-DD

## Fork Landscape Summary

| Metric                                | Value  |
| ------------------------------------- | ------ |
| Total forks                           | N      |
| Pure mirrors                          | N (X%) |
| Divergent forks (ahead on any branch) | N      |
| Substantive forks (meaningful work)   | N      |
| Stars-only miss rate                  | X%     |

## Tiered Ranking

### Tier 1: Major Extensions

(fork details with ahead_by, key features, files changed)

### Tier 2: Targeted Features

...

### Tier 3: Infrastructure/Packaging

...

## Cross-Fork Convergence Patterns

(themes that multiple forks independently implemented)

## Actionable Recommendations

- Cherry-pick candidates
- Feature inspiration
- Security fixes
```

---

## Post-Change Checklist

After modifying THIS skill:

1. [ ] YAML frontmatter valid (no colons in description)
2. [ ] Trigger keywords current in description
3. [ ] All `./references/` links resolve
4. [ ] Pipeline steps numbered consistently
5. [ ] Shell commands tested against a real repository
6. [ ] Append changes to [evolution-log.md](./references/evolution-log.md)

Overview

This skill discovers valuable, non-obvious work in GitHub fork ecosystems by looking beyond stars. It combines branch-level divergence analysis, upstream PR cross-referencing, tags/releases, timestamp clustering, and domain heuristics to surface forks worth attention. The output is a tiered, actionable report with cherry-pick candidates and maintenance signals.

How this skill works

Start by taking an upstream baseline (stars, forks, default branch, last push) and listing forks with pagination. Filter out bulk mirrors using timestamp clustering, then compare default and non-default branches to detect ahead/behind/diverged status. Enrich findings with commit content checks, tag/release detection, open-issue counts, repo rename detection, and upstream PR cross-references to recover force-pushed or rebased work. Rank forks into tiers and produce a concise recommendations report.

When to use it

You want to find feature work that stars-only filters miss.
Triage a new repository to identify maintenance or feature candidates quickly.
Evaluate a specific fork for cherry-pick or upstream merge readiness.
Prioritize work for backporting, packaging, or platform enablement.
Detect cross-fork convergence indicating unmet upstream demand.

Best practices

Always load the appropriate template (Full Analysis, Quick Scan, or Targeted Evaluation) before running analysis.
Use timestamp clustering to remove mirror noise — it typically eliminates 85%+ of forks.
Check non-default branches for any fork with recent pushes or multiple branches — many substantive changes live there.
Cross-reference upstream PR history to recover force-pushed or rebased contributions hidden from compare APIs.
Prioritize signals in this order: branch divergence, upstream PRs, tags/releases, commit email domains, then stars.

Example use cases

5‑minute triage of a new open-source repo to produce a short list of divergent forks.
Deep analysis to find Tier 1 major extensions suitable for cherry-pick into upstream.
Targeted evaluation of a single fork to produce a patch/merge plan and list of relevant commits.
Identify infrastructure and packaging work (CI/CD, Cargo.toml, CMakeLists) useful for releases and distribution.
Detect multiple forks solving the same problem to inform product decisions or feature prioritization.

FAQ

How do you avoid bulk mirrors and forks with no real work?

Group forks by pushed_at timestamps and skip clusters where many forks share identical timestamps; unique timestamps flag forks worth inspecting.

What if a fork force-pushed and shows 0 ahead_by?

Cross-reference upstream PR history and author activity; force-pushed work often appears in PR records even when compare APIs show no ahead commits.