home / skills / jinfanzheng / kode-sdk-csharp / data-base
This skill helps you acquire and structure web data for analysis by selecting appropriate scraping tools and exporting JSON or CSV outputs.
npx playbooks add skill jinfanzheng/kode-sdk-csharp --skill data-baseReview the files below or copy the command above to add this skill to your agents.
---
name: data-base
description: Data acquisition for web scraping and data collection. Use when user needs "爬取数据/抓取网页/scrape data". Outputs structured JSON/CSV for analysis.
---
## Mental Model
Data acquisition is **converting unstructured web content into structured data**. Choose tool based on page complexity: JS-heavy → chrome-devtools MCP, static → Python requests.
## Tool Selection
| Page Type | Tool | When to Use |
|-----------|------|-------------|
| Dynamic (JS-rendered, SPAs) | chrome-devtools MCP | React/Vue apps, infinite scroll, login gates |
| Static HTML | Python requests | Blogs, news sites, simple pages |
| Complex/reusable logic | Python script | Multi-step scraping, rate limiting, proxies |
## Anti-Patterns (NEVER)
- Don't scrape without checking robots.txt
- Don't overload servers (default: 1 req/sec)
- Don't scrape personal data without consent
- Don't use Chinese characters in output filenames (ASCII only)
- Don't forget to identify bot with User-Agent
## Output Format
- **JSON**: Nested/hierarchical data
- **CSV**: Tabular data
- Filename: `{source}_{timestamp}.{ext}` (ASCII only, e.g., `news_20250115.csv`)
## Workflow
1. **Ask**: What data? Which sites? How much?
2. **Select tool** based on page type
3. **Extract** and save structured data
4. **Deliver** file path to user or pass to data-analysis
## Python Environment
**Auto-initialize virtual environment if needed, then execute:**
```bash
cd skills/data-base
if [ ! -f ".venv/bin/python" ]; then
echo "Creating Python environment..."
./setup.sh
fi
.venv/bin/python your_script.py
```
The setup script auto-installs: requests, beautifulsoup4, pandas, web scraping tools.
## References (load on demand)
For detailed APIs and templates, load: `references/REFERENCE.md`, `references/templates.md`
This skill performs data acquisition for web scraping and structured data collection. It converts unstructured web pages into JSON or CSV outputs ready for analysis. Use it when you need reliable scraping workflows for both static and JavaScript-driven sites.
I ask what data and sites you need, then choose the appropriate tool based on page complexity: static pages use HTTP requests and HTML parsing, dynamic pages use a Chrome DevTools-backed browser controller, and complex pipelines run as Python scripts. The extractor respects robots.txt, rate limits, and a configurable User-Agent, then outputs {source}_{timestamp}.json or .csv with ASCII-only filenames.
What output formats are supported?
JSON for hierarchical data and CSV for tabular exports; filenames use source_timestamp.ext.
How do you handle JavaScript-rendered pages?
Use a Chrome DevTools-backed controller to render and interact with pages before extraction.
Are there built-in safety limits?
Yes. The default behavior respects robots.txt, uses a conservative request rate, and sets a clear User-Agent.