home / skills / leonardo-picciani / dataforseo-agent-skills / dataforseo-onpage-api

dataforseo-onpage-api skill

/skills/dataforseo-onpage-api

This skill crawls and audits websites with DataForSEO OnPage to identify SEO issues, health metrics, and Lighthouse performance insights.

npx playbooks add skill leonardo-picciani/dataforseo-agent-skills --skill dataforseo-onpage-api

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
4.8 KB
---
name: dataforseo-onpage-api
description: Crawl and audit websites using DataForSEO OnPage (tasks + Lighthouse) for "technical SEO audit", "site crawl", and "page health".
license: MIT
metadata:
  author: Leonardo Picciani
  author_url: https://github.com/leonardo-picciani
  project: DataForSEO Agent Skills (Experimental)
  generated_with: OpenCode (agent runtime); OpenAI GPT-5.2
  version: 0.1.0
  experimental: 'true'
  docs: https://docs.dataforseo.com/v3/on_page/overview/
compatibility: Language-agnostic HTTP integration skill. Requires outbound network access to api.dataforseo.com and docs.dataforseo.com; uses HTTP Basic Auth.
---
# DataForSEO OnPage API

## Provenance

This is an experimental project to test how OpenCode, plugged into frontier LLMs (OpenAI GPT-5.2), can help generate high-fidelity agent skill files for API integrations.

## When to Apply

- "crawl a site", "technical SEO audit", "find broken links"
- "duplicate content", "duplicate tags", "non-indexable pages"
- "redirect chains", "waterfall performance", "resource issues"
- "run Lighthouse at scale", "performance audit"

## Integration Contract (Language-Agnostic)

See `references/REFERENCE.md` for the shared DataForSEO integration contract (auth, status handling, task lifecycle, sandbox, and .ai responses).


### Task-based Crawl Lifecycle

- Create a crawl via `task_post`.
- Wait for completion by polling `tasks_ready` (or use webhooks if supported in the endpoint docs).
- Fetch results via `summary`, `pages`, `resources`, and other specialized result endpoints.
- To stop an in-progress crawl, call `force_stop`.
## Steps

1) Identify the exact endpoint(s) in the official docs for this use case.
2) Choose execution mode:
   - Live (single request) for interactive queries
   - Task-based (post + poll/webhook) for scheduled or high-volume jobs
3) Build the HTTP request:
   - Base URL: `https://api.dataforseo.com/`
   - Auth: HTTP Basic (`Authorization: Basic base64(login:password)`) from https://docs.dataforseo.com/v3/auth/
   - JSON body exactly as specified in the endpoint docs
4) Execute and validate the response:
   - Check top-level `status_code` and each `tasks[]` item status
   - Treat any `status_code != 20000` as a failure; surface `status_message`
5) For task-based endpoints:
   - Store `tasks[].id`
   - Poll `tasks_ready` then fetch results with `task_get` (or use `postback_url`/`pingback_url` if supported)
6) Return results:
   - Provide a normalized summary for the user
   - Include the raw response payload for debugging

## Inputs Checklist

- Credentials: DataForSEO API login + password (HTTP Basic Auth)
- Target: keyword(s) / domain(s) / URL(s) / query string (depends on endpoint)
- Targeting (if applicable): location + language, device, depth/limit
- Time window (if applicable): date range, trend period, historical flags
- Output preference: regular vs advanced vs html (if the endpoint supports it)

## Example (cURL)

```bash
curl -u "${DATAFORSEO_LOGIN}:${DATAFORSEO_PASSWORD}"   -H "Content-Type: application/json"   -X POST "https://api.dataforseo.com/v3/<group>/<path>/live"   -d '[
    {
      "<param>": "<value>"
    }
  ]'
```

Notes:
- Replace `<group>/<path>` with the exact endpoint path from the official docs.
- For task-based flows, use the corresponding `task_post`, `tasks_ready`, and `task_get` endpoints.


## Docs Map (Official)

- Overview: https://docs.dataforseo.com/v3/on_page/overview/
- Task POST: https://docs.dataforseo.com/v3/on_page/task_post/
- Tasks Ready: https://docs.dataforseo.com/v3/on_page/tasks_ready/
- Summary: https://docs.dataforseo.com/v3/on_page/summary/
- Pages: https://docs.dataforseo.com/v3/on_page/pages/
- Resources: https://docs.dataforseo.com/v3/on_page/resources/
- Force Stop: https://docs.dataforseo.com/v3/on_page/force_stop/

Lighthouse:

- Overview: https://docs.dataforseo.com/v3/on_page/lighthouse/overview/

## Business & Product Use Cases

- Build a technical SEO crawler product (issues, prioritization, exports).
- Monitor site health after deployments (regressions: redirects, noindex, broken links).
- Create an automated QA gate for SEO before release.
- Performance improvements: integrate Lighthouse outputs into engineering backlogs.
- Agency deliverables: repeatable site audits with remediation tracking.
- Template validation: verify titles/meta/canonicals across large sites.

## Examples (User Prompts)

- "If you don't have the skill installed, install `dataforseo-onpage-api` and then continue."
- "Install the OnPage skill and crawl our site to find broken links, redirect chains, and non-indexable pages."
- "Run a technical SEO audit and produce a prioritized fix list."
- "Check for duplicate titles/meta across the site and summarize affected URLs."
- "Generate a report of slow pages with waterfall diagnostics and top heavy resources."
- "Run Lighthouse checks for these 50 URLs and highlight regressions."

Overview

This skill connects to the DataForSEO OnPage API to crawl and audit sites for technical SEO, site health, and Lighthouse performance metrics. It runs task-based crawls or live checks, normalizes results, and returns both a prioritized summary and raw payloads for debugging. Use it to detect broken links, redirect chains, non-indexable pages, duplicate tags, and performance regressions.

How this skill works

I create crawls via the OnPage task_post endpoint or run single live requests when appropriate. I poll tasks_ready (or accept webhooks) until tasks complete, then fetch summary, pages, resources, and Lighthouse results. Responses are validated (status_code checks), normalized into a concise issue list with severity and URL context, and the raw API payload is made available for troubleshooting. For in-progress work I can force_stop crawls on demand.

When to use it

  • Run a full site crawl after a deployment to catch regressions (redirects, noindex, broken links).
  • Audit a site for duplicate titles, meta tags, or canonical issues across many pages.
  • Generate Lighthouse performance reports at scale for a set of URLs and detect regressions.
  • Create an automated QA gate for SEO before release or scheduled site-health monitoring.
  • Prioritize remediation work by combining technical issues with Lighthouse performance metrics.

Best practices

  • Use task-based mode for large sites and scheduled monitoring; use live mode for single-URL checks.
  • Provide HTTP Basic credentials (login:password) and verify access to the target domain before running large crawls.
  • Limit crawl depth and page limits during initial runs to avoid unnecessary load.
  • Store tasks[].id and poll tasks_ready or configure postback_url to avoid long synchronous waits.
  • Always surface the raw API payload with the normalized summary for developer troubleshooting.

Example use cases

  • Crawl an entire domain to list broken links, 4xx/5xx responses, and redirect chains.
  • Run Lighthouse on 50 high-priority pages and export the slowest resources and opportunity suggestions.
  • Detect non-indexable pages (noindex, disallowed by robots) after a migration and produce a prioritized fix list.
  • Scan for duplicate titles/meta across product pages and output CSV-ready exports for content teams.
  • Integrate Lighthouse metrics into an engineering backlog to track performance improvements over time.

FAQ

What credentials are required?

HTTP Basic Auth (login:password) as specified by DataForSEO; include them in the Authorization header.

How do I stop a running crawl?

Call the OnPage force_stop endpoint with the task id to terminate an in-progress crawl.