home / skills / bobmatnyc / claude-mpm-skills / sec-edgar-pipeline

sec-edgar-pipeline skill

/universal/data/sec-edgar-pipeline

This skill guides SEC EDGAR extraction workflows from setup to report generation, enabling rapid project creation, code generation, and data export.

npx playbooks add skill bobmatnyc/claude-mpm-skills --skill sec-edgar-pipeline

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
3.0 KB
---
name: sec-edgar-pipeline
description: "SEC EDGAR extraction pipeline: setup, filing discovery by CIK, recipe-driven extraction, and report generation."
version: 1.0.0
category: universal
author: Claude MPM Team
license: MIT
progressive_disclosure:
  entry_point:
    summary: "EDGAR pipeline: configure keys + user-agent, find filings by CIK, extract data via recipes/scripts, and export CSV/JSON reports."
    when_to_use: "Building SEC EDGAR extraction flows, running edgar-analyzer CLI, or generating compensation/filing reports from DEF 14A and related filings."
    quick_start: "1. edgar-analyzer setup 2. edgar-analyzer analyze-project projects/<name> 3. edgar-analyzer generate-code 4. edgar-analyzer run-extraction"
tags:
  - sec
  - edgar
  - filings
  - cik
  - def14a
  - extraction
---

# SEC EDGAR Pipeline

## Overview

This pipeline is centered on `edgar-analyzer` and the EDGAR data sources. The core loop is: configure credentials, create a project with examples, analyze patterns, generate code, run extraction, and export reports.

## Setup (Keys + User Agent)

Use the setup wizard to configure required keys:

```bash
python -m edgar_analyzer setup
# or
edgar-analyzer setup
```

Required entries:

- `OPENROUTER_API_KEY`
- (Optional) `JINA_API_KEY`
- `EDGAR` user agent string ("Name [email protected]")

## End-to-End CLI Workflow

```bash
# 1. Create project
edgar-analyzer project create my_project --template minimal

# 2. Add examples + project.yaml
# projects/my_project/examples/*.json

# 3. Analyze examples
edgar-analyzer analyze-project projects/my_project

# 4. Generate extraction code
edgar-analyzer generate-code projects/my_project

# 5. Run extraction
edgar-analyzer run-extraction projects/my_project --output-format csv
```

Outputs land in `projects/<name>/output/`.

## EDGAR-Specific Conventions

- **CIK** values are 10-digit, zero-padded (e.g., `0000320193`).
- **Rate limit**: SEC API allows 10 requests/sec. Scripts use ~0.11s delays.
- **User agent** is mandatory; include name + email.

## Scripted Example (Apple DEF 14A)

`edgar/scripts/fetch_apple_def14a.py` shows the direct flow:

1. Fetch latest DEF 14A metadata
2. Download HTML
3. Parse Summary Compensation Table (SCT)
4. Save raw HTML + extracted JSON + ground truth

## Recipe-Driven Extraction

`edgar/recipes/sct_extraction/config.yaml` defines a multi-step pipeline:

- Fetch DEF 14A filings by company list
- Extract SCT tables with `SCTAdapter`
- Validate with `sct_validator`
- Write results to `output/sct`

## Report Generation

`edgar/scripts/create_csv_reports.py` converts JSON results into:

- `executive_compensation_<timestamp>.csv`
- `top_25_executives_<timestamp>.csv`
- `company_summary_<timestamp>.csv`

## Troubleshooting

- **No filings found**: confirm CIK formatting and filing type (DEF 14A vs DEF 14A/A).
- **API errors**: slow down requests and confirm user-agent is set.
- **Extraction errors**: regenerate code or use manual ground truth in POC scripts.

## Related Skills

- `universal/data/reporting-pipelines`
- `toolchains/python/testing/pytest`

Overview

This skill provides an end-to-end SEC EDGAR extraction pipeline for discovering filings by CIK, driving recipe-based extraction, and generating CSV reports. It guides setup of API keys and user agent, project creation with examples, automated analysis and code generation, and repeatable extraction runs. The focus is on reliable bulk extraction of tables such as the Summary Compensation Table and producing standardized outputs for analysis.

How this skill works

You configure required credentials and a user-agent, create a project with example filings, and run an analysis step that infers extraction patterns. The tool then generates extraction code according to recipes, executes the extraction against EDGAR (respecting rate limits), validates results, and exports JSON and CSV reports to a project output directory. Built-in adapters and validators (for example, an SCTAdapter and sct_validator) support targeted table extraction and quality checks.

When to use it

  • Bulk extract compensation and governance tables (e.g., Summary Compensation Table) across many companies.
  • Automate periodic scraping and reporting for a list of CIKs or filing types.
  • Create ground truth and iterate on extraction patterns during a proof-of-concept.
  • Generate standardized CSV reports for downstream analysis or ML pipelines.
  • Validate and re-run extractions after site changes or parser updates.

Best practices

  • Always set a descriptive user-agent with name and contact email to comply with SEC requirements.
  • Format CIKs as 10-digit, zero-padded strings (e.g., 0000320193) to avoid missed matches.
  • Respect EDGAR rate limits (max ~10 requests/sec); use built-in delays to prevent throttling.
  • Provide representative examples in the project examples folder to improve automated pattern inference.
  • Use recipe-driven configs to separate fetching, extraction, validation, and output stages for reproducibility.

Example use cases

  • Collect and compare executive compensation across the S&P 500 using the SCT recipe and export top executives CSV.
  • Monitor a curated list of CIKs for new DEF 14A filings and automatically append parsed results to a data lake.
  • Build a POC extractor for a new table: add a few annotated examples, run analyze and generate-code, then validate outputs.
  • Produce company-level summary reports and a consolidated executive compensation snapshot for investor presentations.

FAQ

What credentials are required?

Set a valid OPENROUTER_API_KEY and a required EDGAR user-agent string (name and email). A JINA API key is optional when using vector search components.

How do I avoid rate limit errors?

Keep requests paced; the pipeline uses small delays (~0.11s) by default. If you see API errors, increase delays or reduce concurrency.