home / skills / jinfanzheng / kode-sdk-csharp / data-base

This skill helps you acquire and structure web data for analysis by selecting appropriate scraping tools and exporting JSON or CSV outputs.

npx playbooks add skill jinfanzheng/kode-sdk-csharp --skill data-base

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
1.8 KB
---
name: data-base
description: Data acquisition for web scraping and data collection. Use when user needs "爬取数据/抓取网页/scrape data". Outputs structured JSON/CSV for analysis.
---

## Mental Model

Data acquisition is **converting unstructured web content into structured data**. Choose tool based on page complexity: JS-heavy → chrome-devtools MCP, static → Python requests.

## Tool Selection

| Page Type | Tool | When to Use |
|-----------|------|-------------|
| Dynamic (JS-rendered, SPAs) | chrome-devtools MCP | React/Vue apps, infinite scroll, login gates |
| Static HTML | Python requests | Blogs, news sites, simple pages |
| Complex/reusable logic | Python script | Multi-step scraping, rate limiting, proxies |

## Anti-Patterns (NEVER)

- Don't scrape without checking robots.txt
- Don't overload servers (default: 1 req/sec)
- Don't scrape personal data without consent
- Don't use Chinese characters in output filenames (ASCII only)
- Don't forget to identify bot with User-Agent

## Output Format

- **JSON**: Nested/hierarchical data
- **CSV**: Tabular data
- Filename: `{source}_{timestamp}.{ext}` (ASCII only, e.g., `news_20250115.csv`)

## Workflow

1. **Ask**: What data? Which sites? How much?
2. **Select tool** based on page type
3. **Extract** and save structured data
4. **Deliver** file path to user or pass to data-analysis

## Python Environment

**Auto-initialize virtual environment if needed, then execute:**

```bash
cd skills/data-base

if [ ! -f ".venv/bin/python" ]; then
    echo "Creating Python environment..."
    ./setup.sh
fi

.venv/bin/python your_script.py
```

The setup script auto-installs: requests, beautifulsoup4, pandas, web scraping tools.

## References (load on demand)

For detailed APIs and templates, load: `references/REFERENCE.md`, `references/templates.md`

Overview

This skill performs data acquisition for web scraping and structured data collection. It converts unstructured web pages into JSON or CSV outputs ready for analysis. Use it when you need reliable scraping workflows for both static and JavaScript-driven sites.

How this skill works

I ask what data and sites you need, then choose the appropriate tool based on page complexity: static pages use HTTP requests and HTML parsing, dynamic pages use a Chrome DevTools-backed browser controller, and complex pipelines run as Python scripts. The extractor respects robots.txt, rate limits, and a configurable User-Agent, then outputs {source}_{timestamp}.json or .csv with ASCII-only filenames.

When to use it

  • You need tabular exports (CSV) or nested JSON from web pages
  • Target pages are JavaScript-heavy or single-page applications
  • You want reproducible, scriptable scraping with rate limits and proxies
  • You need quick one-off scrapes of static sites like blogs or news
  • You plan to feed scraped output into downstream analysis or ETL

Best practices

  • Always check and respect robots.txt before scraping a site
  • Default to conservative rate limiting (example: 1 request/second)
  • Identify the scraper with a clear User-Agent string
  • Avoid collecting personal data without consent and follow privacy rules
  • Use ASCII-only filenames and timestamped outputs (source_YYYYMMDDHHMM.ext)

Example use cases

  • Scrape article metadata (title, author, date, tags) from a news site into CSV
  • Extract product listings and prices from a React e-commerce SPA using Chrome DevTools control
  • Run a repeatable Python pipeline to gather paginated search results with retries and proxies
  • Produce nested JSON records from forum threads for topic modeling
  • Generate a timestamped dataset to feed into a data analysis notebook

FAQ

What output formats are supported?

JSON for hierarchical data and CSV for tabular exports; filenames use source_timestamp.ext.

How do you handle JavaScript-rendered pages?

Use a Chrome DevTools-backed controller to render and interact with pages before extraction.

Are there built-in safety limits?

Yes. The default behavior respects robots.txt, uses a conservative request rate, and sets a clear User-Agent.