home / skills / sawyerh / agents / setup-scheduled-scraper

setup-scheduled-scraper skill

/skills/setup-scheduled-scraper

This skill scaffolds a local Playwright and Next.js scheduled scraper project with optional read-only viewer and macOS launchd automation.

npx playbooks add skill sawyerh/agents --skill setup-scheduled-scraper

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
4.8 KB
---
name: setup-scheduled-scraper
description: Set up a local Playwright + Next.js scraper project with scheduled runs and an optional web viewer. Use when asked to build a project that scrapes specific URLs on a schedule, stores results locally (JSON), and optionally provides a Next.js App Router + Tailwind + Shadcn data viewer. Assumes macOS launchd automation by default.
---

# Setup Scheduled Scraper

## Overview

Build a local, scheduled scraper that runs via Playwright and writes JSON results, with an optional Next.js viewer for tables/charts. Default stack: TypeScript, Playwright test runner, Next.js App Router, Tailwind v4, Shadcn UI, and launchd scheduling.

## Workflow

1. Intake the request (read `references/intake.md`).
2. Scaffold the project (Next.js app + Playwright + TypeScript).
3. Implement the scraper pipeline (URLs -> parsed data -> JSON).
4. Add the optional viewer (read-only).
5. Add scheduling + logging with launchd.
6. Verify manual run, schedule, and viewer.

## Example Project Structure

```
project/
├── src/
│   ├── app/
│   │   ├── layout.tsx            # Next.js root layout
│   │   └── page.tsx              # Viewer entry page
│   ├── launchd/
│   │   ├── com.example.scraper.plist       # LaunchAgent schedule
│   │   └── com.example.scraper-wake.plist  # LaunchDaemon wake helper
│   ├── lib/                      # Viewer helpers
│   ├── scraper.ts                # Playwright entry (called by test spec)
│   └── scrape.spec.ts            # Playwright spec that invokes scraper
├── scripts/
│   ├── clear_logs.sh             # Clears scheduler logs
│   ├── run_playwright_daily.sh   # Scheduled wrapper (logs + npm run scrape)
│   ├── update-schedule.sh        # Updates launchd schedule times
│   └── schedule-wakes.sh         # Optional pmset wake scheduling
├── results.json                  # Scheduled output (read-only)
├── results-local.json            # Manual run output
├── scraper-metadata.json         # Run metadata
├── package.json
├── tsconfig.json
└── README.md
```

### Example system architecture

```mermaid
flowchart TB
  LaunchDaemon["LaunchDaemon (wake scheduler)"] --> pmset["pmset wake events"]
  pmset --> LaunchAgent
  LaunchAgent["LaunchAgent (twice daily)"] --> Runner["scripts/run_playwright_daily.sh"]
  Runner --> Scrape["npm run scrape (Playwright)"]
  Scrape --> Results["results.json"]
  Scrape --> Metadata["scraper-metadata.json"]
  Results --> Viewer["Next.js web app"]
  Metadata --> Viewer
  Viewer --> Vercel
  Vercel --> User
```

## Data conventions

- Use `results.json` for scheduled runs; use `results-local.json` for manual runs.
- Support overriding the output path via `SCRAPE_RESULTS_PATH`.
- Store run metadata in `scraper-metadata.json` (timestamp, counts, errors).

## Example JSON

results.json (array of records):

```json
[
  {
    "url": "https://example.com/scoreboard/some-unique-id",
    "title": "Knicks at Lakers",
    "game_start_time": "2026-02-01T19:00:00-08:00",
    "scraped_at": "2026-02-01T07:00:12-08:00"
  },
  {
    "url": "https://example.com/scoreboard/some-unique-id-2",
    "title": "Bucks at Warriors",
    "game_start_time": "2026-02-01T21:30:00-08:00",
    "scraped_at": "2026-02-01T07:00:12-08:00"
  }
]
```

scraper-metadata.json:

```json
{ "last_scraped_at": "2026-02-01T07:00:12-08:00" }
```

## Scheduling (macOS launchd)

- Start from the shell scripts in `scripts/` and customize them for the project (`PROJECT_SLUG`, paths, labels).
- Use a LaunchAgent to run the wrapper script at scheduled times.
- Keep the LaunchAgent plist in the repo and **symlink** it into `~/Library/LaunchAgents`.
- If the user wants wake-from-sleep, add a LaunchDaemon + `pmset schedule wakeorpoweron` helper.
- For wake scheduling, copy the LaunchDaemon plist into `/Library/LaunchDaemons` (not a symlink) and set ownership to `root:wheel`.
- Provide an `update-schedule.sh` helper to edit `StartCalendarInterval` with two daily times. If more than two times are needed, ask before expanding the schedule logic.

## Multi-project notes

- Ensure each project has a unique LaunchAgent label and plist filename.
- Use distinct log file paths per project.
- If using a wake LaunchDaemon, give it a unique label and owner tag.

## Viewer guidelines

- Use Next.js App Router and keep the UI read-only.
- Prefer Shadcn components and Tailwind defaults; avoid extra overrides.
- Derive filtered subsets once, then compute metrics/views from those subsets.

## Verification

- Manual run: `npm run scrape` (and `npm run scrape:ui` for Playwright UI).
- Viewer: `npm run dev`.
- Schedule checks: `launchctl list` and `pmset -g sched`.
- Logs: `tail -n 200 ~/Library/Logs/<project>.out.log ~/Library/Logs/<project>.err.log`.

## References

- `references/intake.md`
- `references/checklists.md`

Overview

This skill sets up a local Playwright + Next.js scraper project that runs on a schedule and stores results as JSON. It scaffolds a TypeScript scraper, launchd scheduling for macOS, logging scripts, and an optional Next.js App Router + Tailwind + Shadcn web viewer. The output is ready-to-run with sensible defaults and helpers to customize schedule times and output locations.

How this skill works

The skill scaffolds a project with a Playwright entry (scraper.ts) and a test spec that invokes it, writing structured records to results.json for scheduled runs and results-local.json for manual runs. It includes shell scripts to wrap Playwright runs, plist files for LaunchAgent/LaunchDaemon scheduling, optional pmset wake helpers, and a read-only Next.js viewer wired to the JSON output. Environment variables (e.g., SCRAPE_RESULTS_PATH) and metadata files (scraper-metadata.json) let you control output locations and track run details.

When to use it

  • You need a reproducible local scraper that runs on a schedule (macOS).
  • You want Playwright-based scraping with TypeScript and test-driven invocation.
  • You want scheduled JSON output and simple run metadata for downstream processing.
  • You want an optional read-only Next.js viewer for browsing results locally or on Vercel.
  • You need launchd-compatible automation with wake-from-sleep support.

Best practices

  • Keep project-specific labels, filenames, and log paths unique across machines to avoid plist collisions.
  • Use results.json for scheduled runs and results-local.json for manual troubleshooting; support SCRAPE_RESULTS_PATH to override paths.
  • Test scraping locally with npm run scrape before enabling the LaunchAgent; verify logs with tail and launchctl list.
  • Limit the LaunchAgent to two default daily times; ask before expanding to more intervals and use update-schedule.sh to edit StartCalendarInterval.
  • Keep the Next.js viewer read-only and compute derived views from filtered subsets to avoid heavy runtime processing.

Example use cases

  • Twice-daily scrape of public scoreboard pages; results.json drives a local dashboard and notifications.
  • Periodic extraction of product prices for comparison; scheduled runs append fresh snapshots and scraper-metadata.json stores timestamps and counts.
  • Nightly capture of event listings with optional pmset wake scheduling to ensure machine wakes before scrape.
  • Local development setup where manual runs write results-local.json for debugging while scheduled runs write results.json for the viewer.

FAQ

Can I change the schedule times later?

Yes. Use the provided update-schedule.sh to modify StartCalendarInterval values and re-load the LaunchAgent. For wake scheduling, update the LaunchDaemon and pmset helper as needed.

How do I test the scraper without the schedule?

Run npm run scrape or npm run scrape:ui for an interactive Playwright session; results-local.json will receive the output for debugging.