home / skills / inclusionai / aworld / read_large_webpage

read_large_webpage skill

safe

/examples/skill_agent/skills/read_large_webpage

This skill reads long documents in segments, summarizes in real time, and builds a traceable knowledge base for scalable understanding.

npx playbooks add skill inclusionai/aworld --skill read_large_webpage

Review the files below or copy the command above to add this skill to your agents.

Files (1)

skill.md

3.2 KB

---
name: read large webpage or knowledge
description: This skill is used for segmented reading and organization when facing large-scale knowledge bases or web pages. It captures original content segment by segment, summarizes key points in real-time, and continuously deposits them into the knowledge base, ensuring orderly information ingestion, clear structure, and traceability.
tool_list: {"ms-playwright": []}
active: True
type: agent
---
### 🧠 Knowledge Base
- **Target Scenarios**: Reading long technical documents, research reports, policy documents, web encyclopedias, etc.
- **Core Capabilities**: Segment-based retrieval of original text, real-time summarization, and knowledge network construction.
- **Supporting Tools**: `get_knowledge_by_lines` (segment-by-segment reading), `add_knowledge` (incremental summary writing).

### 📥 Input Specification
Before starting to read, the following should be clarified:
1. The identifier of the knowledge resource to be read (e.g., URL, document ID, file path).
2. The number of lines or paragraph size to pull each time.
3. The current question or topic of focus, to maintain focus during summarization.
4. Output format requirements (paragraph summaries, bullet points, continuous records, etc.).

### 🛠️ Processing Pipeline
1. **Locate Range**: Determine the starting line number and reading length based on user input, and record offsets when necessary for continuation.
2. **Segment-by-Segment Reading**: Call `get_knowledge_by_lines` to pull the original content of the specified range. If the content is too long, it can be scheduled in multiple batches, and record the remaining unread ranges.
3. **Real-Time Analysis**: Extract key points from the pulled segments, annotate keywords, key information, potential issues, or data.
4. **Knowledge Deposition**: Write the refined key points into the knowledge base through `add_knowledge`, along with source line numbers, timestamps, or context descriptions, maintaining structure.
5. **Iterative Progress**: Repeat steps 2-4 until the entire text is read or the user-defined target depth is reached, while maintaining progress indices for recovery.
6. **Global Review**: At periodic nodes, merge stored summaries, generate overall context maps or summaries, and identify missing information.

### 🔁 Iterative Tips
- If cross-segment comparison is needed, it is recommended to preserve original fragment IDs for traceability.
- For key concepts, additional reasoning skills can be called for verification or expansion.
- It is recommended to record unanswered questions in summaries, which should be prioritized when continuing to consult later.

### 📤 Output Template
```
📍 Reading Progress
- Source: ...
- Range: Line ... - ...
- Remaining: ...

📝 Summary Points
- Point 1: ...
- Point 2: ...
- Point 3: ...

🧾 Stored Knowledge
- Knowledge ID: ...
- Summary: ...
- Reference: ...

⚠️ Pending Issues
- ...
```

### ✅ Output Checklist
- Is the reading range and remaining progress accurately annotated?
- Does the summary cover key information and context?
- Have key points been promptly written to the knowledge base and linked to sources?
- Have unresolved issues or parts requiring in-depth exploration been recorded?

Overview

This skill performs disciplined, segment-by-segment reading of large web pages or knowledge bases and turns raw text into structured, traceable knowledge entries. It captures original fragments, produces real-time summaries, and incrementally writes refined points into a knowledge store to maintain order and recoverable progress. The workflow supports iterative reads, cross-segment traceability, and periodic global reviews.

How this skill works

Before reading, the skill records the resource identifier, desired segment size, focus topic, and output format. It fetches text ranges via get_knowledge_by_lines, extracts and annotates key points in real time, and stores summaries with context using add_knowledge. Progress indices, source line numbers, timestamps, and fragment IDs are kept to enable continuation, cross-segment comparison, and traceability. At checkpoints it merges partial summaries into higher-level context maps and flags unresolved questions for follow-up.

When to use it

Ingesting long technical reports, white papers, or policy documents that exceed single-pass processing limits.
Building or updating a searchable knowledge base from lengthy web pages or document archives.
Performing focused research where you need summaries tied to exact source lines for citation or verification.
Gradually training agents or building world models from large corpora while preserving provenance.
Recovering interrupted reads and resuming from precise offsets without data duplication.

Best practices

Specify the resource ID, segment size, topic focus, and desired output format before starting to read.
Use conservative segment sizes to keep fetches fast and summaries precise; record offsets for robust recovery.
Preserve original fragment IDs and source line numbers to enable cross-segment comparisons and traceability.
Write intermediate summaries promptly with add_knowledge so partial results are usable and durable.
Record unanswered questions or potential data gaps in each summary so follow-up reads can prioritize them.

Example use cases

Summarize a 200-page technical standard into linked 500-line summaries with source offsets for later verification.
Read a large web encyclopedia entry in 100-line segments, store concise bullet summaries, and merge into a topic map for an agent training dataset.
Extract and annotate policy document sections, flag ambiguous clauses, and deposit resolved interpretations into a knowledge base.
Continuously ingest news archives by date ranges, maintain progress indices, and run periodic global reviews to build a temporal knowledge graph.

FAQ

How do I choose the right segment size?

Balance fetch speed and summary clarity: smaller segments (50–200 lines) keep summaries focused and reduce processing time; larger segments can be used for tightly coherent sections but risk losing precision.

What metadata should be stored with each summary?

Store source identifier, line range, fragment ID, timestamp, summary text, key keywords, and any flagged issues to ensure provenance and easy retrieval.