scraping skill

not checked
npx playbooks add skill knoopx/pi --skill scraping
Review the files below or copy the command above to add this skill to your agents.
Files (1)
SKILL.md
5.1 KB
---
name: scraping
description: Fetch web pages, parse HTML with CSS selectors, call REST APIs, and scrape dynamic content. Use when extracting data from websites, querying JSON APIs, or automating browser interactions.
---

# Scraping

Web scraping techniques using nu-shell (nu) and browser tools. Use this skill for quick data extraction from websites, HTML parsing, API interactions, and dynamic content scraping.

## Prerequisites

- nu-shell installed (`nu`)
- `query web` plugin installed (for HTML scraping): `nu -c "plugin add query web"`
- Browser extension enabled (for dynamic content): Enable the `browser` extension in your agent configuration

## Common Tasks

### Fetching Web Pages

Use `http get` to retrieve HTML content:

```bash
# Simple GET request
nu -c 'http get https://example.com'

# With headers
nu -c 'http get -H [User-Agent "My Scraper"] https://example.com'
```

### HTML Parsing and Data Extraction

Use the `query web` plugin to parse HTML and extract data using CSS selectors:

```bash
# Extract text from elements
nu -c 'http get https://example.com | query web -q "h1, h2" | str trim'

# Extract attributes
nu -c 'http get https://example.com | query web -a href "a"'

# Parse tables as structured data
nu -c 'http get https://example.com/table-page | query web --as-table ["Column1" "Column2"]'
```

### Browser-Based Scraping for Dynamic Content

For websites requiring JavaScript execution or complex DOM interactions, use browser automation tools.

```bash
# Start browser
start-browser

# Navigate to page
navigate-browser --url https://example.com

# Extract data with JavaScript evaluation
evaluate-javascript --code "Array.from(document.querySelectorAll('selector')).map(e => e.textContent)"

# Screenshot for visual inspection
take-screenshot

# Query HTML fragments
query-html-elements --selector ".content"
```

### API Interactions

For JSON APIs, use `http get` and parse with `from json`:

```bash
# GET JSON API
nu -c 'http get https://api.example.com/data | from json'

# POST requests
nu -c 'http post https://api.example.com/submit -t application/json {key: value}'
```

### Handling Authentication

```bash
# Basic auth
nu -c 'http get -u username:password https://api.example.com'

# Bearer token
nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com'

# Custom headers
nu -c 'http get -H [X-API-Key "YOUR_KEY" User-Agent "Scraper"] https://api.example.com'
```

### Rate Limiting and Delays

```bash
# Add delays between requests
nu -c '$urls | each { |url| http get $url; sleep 1sec }'
```

### Parallel Processing

```bash
# Scrape multiple pages in parallel
nu -c '$urls | par-each { |url| http get $url | query web -q ".data" }'
```

## One-liner Examples

### Basic HTML Scraping

```bash
# Extract all h1 titles
nu -c 'http get https://example.com | query web -q "h1"'

# Get all links
nu -c 'http get https://example.com | query web -a href "a"'

# Scrape product prices
nu -c 'http get https://store.example.com | query web -q ".price"'
```

### HTML Scraping Example: Hacker News

```bash
# Scrape HN front page titles and URLs
nu -c 'http get https://news.ycombinator.com/ | query web -q ".titleline a" | get text | zip (http get https://news.ycombinator.com/ | query web -a href ".titleline a" | get href) | each { |pair| echo $"($pair.0) - ($pair.1)" }'
```

For static sites like HN, use `http get` directly. Reserve browser tools for dynamic content requiring JavaScript execution.

### GitHub Stars Scraper

```bash
# Get star count for a repo
nu -c 'http get https://api.github.com/repos/nushell/nushell | get stargazers_count'
```

### API Data Extraction

```bash
# Fetch JSON and extract fields
nu -c 'http get https://api.example.com/users | from json | get -i 0.name'
```

### API Authentication

```bash
# Bearer token
nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com/data'

# API key
nu -c 'http get -H [X-API-Key "YOUR_API_KEY"] https://api.example.com/data'

# Basic auth
nu -c 'http get -u username:password https://api.example.com/protected'
```

## Related Skills

- **nu-shell**: Core nu-shell scripting patterns and commands.

## Related Tools

- **start-browser**: Start Cromite browser via Puppeteer.
- **navigate-browser**: Navigate to a URL in the browser.
- **evaluate-javascript**: Evaluate JavaScript code in the active browser tab.
- **take-screenshot**: Take a screenshot of the active browser tab.
- **query-html-elements**: Extract HTML elements by CSS selector.
- **list-browser-tabs**: List all open browser tabs with their titles and URLs.
- **close-tab**: Close a browser tab by index or title.
- **switch-tab**: Switch to a specific tab by index.
- **refresh-tab**: Refresh the current tab.
- **current-url**: Get the URL of the current active tab.
- **page-title**: Get the title of the current active tab.
- **wait-for-element**: Wait for a CSS selector to appear on the page.
- **click-element**: Click on an element by CSS selector.
- **type-text**: Type text into an input field.
- **extract-text**: Extract text content from elements by CSS selector.
- **search-web**: Perform web searches and extract information from search results.