home / skills / trotsky1997 / my-claude-agent-skills / codebase-reading
This skill enables rapid comprehension of large Python codebases by guiding goal-driven reading, architecture mapping, and safe, incremental modifications.
npx playbooks add skill trotsky1997/my-claude-agent-skills --skill codebase-readingReview the files below or copy the command above to add this skill to your agents.
---
name: codebase-reading
description: Systematic methodology for reading and understanding large codebases efficiently. Use when (1) Understanding a new or unfamiliar codebase quickly, (2) Preparing to modify or extend existing code safely, (3) Debugging complex issues requiring deep code understanding, (4) Onboarding new team members to a codebase, (5) Performing code audits or security reviews, (6) Refactoring legacy code with confidence, (7) Creating documentation for existing systems, (8) Tracing execution flows and data transformations
metadata:
short-description: Efficiently read and understand large codebases
---
# Large Codebase Reading Methodology
A systematic approach to understanding large codebases efficiently: **read to modify safely, not to memorize everything**.
## Core Principle
**Goal-oriented reading**: Always start with a concrete objective (fix bug, add feature, debug issue, security audit). Without a goal, you'll get lost in details.
**Success criteria**: You can explain the execution flow and modify code confidently, not that you've read every file.
## Documentation Structure
When documenting your understanding, use this structure:
```
code-reading/
├── README.md # Index and progress tracking
├── code-reading.md # Main framework (methodology + findings + terminology)
├── architecture.md # C4 architecture map (Level 1-4)
├── api-flow.md # Execution flow tracing
└── key-modules.md # Detailed module analysis
```
**Progressive disclosure:**
- README: Overview and navigation
- Main doc: Methodology and high-level findings
- Specialized docs: Detailed analysis (loaded only when needed)
**Terminology section:**
- Include a dedicated "Terminology" or "Key Concepts and Terminology" section in `code-reading.md` (e.g., section 9)
- Build incrementally from day 1, don't wait until everything is understood
- Cross-reference terms throughout documentation (architecture, flows, modules)
- Maintain as a living document - update as understanding deepens
## Quick Start
### Step 1: Define Your Goal
Start with a concrete, one-sentence objective:
- ✅ Good: "Trace HTTP request from route handler to database and back"
- ✅ Good: "Understand how authentication tokens are validated and refreshed"
- ❌ Bad: "Understand the entire codebase"
### Step 2: Run the Project First
**First hour priority:**
1. Read README - project purpose, setup, minimal example
2. Read CONTRIBUTING/development guide - testing, structure, workflow
3. Run a minimal path - local build, one test, or demo
**Why:** Running code transforms reading from guessing to verification.
### Step 3: Map the Architecture (C4 Thinking)
Draw a rough map from far to near (no need to be perfect):
**Level 1: System Context** - Who uses it? What external dependencies?
**Level 2: Containers** - What are the deployable units? (services, databases, workers)
**Level 3: Components** - What are the key components within one container?
> You don't need beautiful diagrams - just enough to navigate: **where's the entry point, how does data flow, where are the boundaries?**
### Step 4: Trace a Real Request Path
Instead of reading modules, **read one execution path**:
- Web: route → controller → service → repository → external API
- CLI: main → command parser → execution path
- Async: consumer → handler → processing → ack/retry
**Strongly recommended:** Use a debugger, add logging, set breakpoints to walk through once.
### Step 5: Treat Tests as Executable Documentation
If code is old/hard to test, write **characterization tests** first - record current behavior, then refactor under protection.
## Core Workflow
### 1. Goal-Oriented Setup
**Define your reading task:**
Write a one-sentence goal that describes what you want to achieve:
- "I can explain the request flow from HTTP to database"
- "I can safely modify the authentication module"
- "I can trace the OCR pipeline from image input to result output"
**Avoid:** Reading without purpose or starting from directory trees.
### 2. Project Bootstrap
**Priority order (first hour):**
1. **README.md** - What does it do? How to start? Minimal example?
2. **CONTRIBUTING.md / docs/** - How to test? Code structure? Branch strategy?
3. **Run minimal path** - Can you build it? Run one test? Execute a demo?
**Verify:** Model files exist, tests run, examples work.
### 3. Architecture Mapping (C4 Model)
**Level 1: System Context**
- Users/applications that interact with the system
- External dependencies (databases, APIs, services)
- System boundaries
**Level 2: Containers**
- Deployable units (frontend, API server, workers, databases, queues)
- Communication between containers
**Level 3: Components**
- Key components within a container (auth service, domain service, repository, adapter)
- Component relationships and dependencies
**Output:** A rough map that answers:
- Where's the entry point?
- How does data flow?
- Where are the boundaries?
### 4. Entry Point + Request Tracing
**Find the entry point:**
- Web: `main()`, route handlers, controllers
- CLI: `main()`, command parsers
- Library: public API functions, constructors
**Trace one complete path:**
- Use debugger to step through
- Add logging at key points
- Set breakpoints to verify understanding
**Document the flow:** Write down the execution path as you trace it.
### 5. Tests as Documentation
**Existing tests:**
- Read tests to understand expected behavior
- Tests show how components are used
- Tests document edge cases and error handling
**Missing tests:**
- Write characterization tests (record current behavior)
- Use tests as a safety net before modifications
- Tests become regression suite
### 6. Git Archaeology
**When you see "weird code":**
Don't dismiss it immediately - ask: **What historical problem is this solving?**
**Commands:**
```bash
git blame -w -- path/to/file # Who changed this and when?
git log -p -- path/to/file # Full change history
git log --grep="keyword" # Find related commits
git show <commit-hash> # View specific change
```
**What to look for:**
- Commit messages explaining why
- PR discussions showing trade-offs
- Major refactors showing architecture evolution
### 7. Use Tools Efficiently
**Code search:**
```bash
rg "keyword" -n . # Ripgrep (faster than grep)
git grep "keyword" # Git-optimized search
rg "main\(" . # Find entry points
rg "TODO|FIXME" . # Find todos
```
**Git tools:**
```bash
git log --graph --oneline --all # Visual history
git log --follow -- path/to/file # File rename tracking
git diff HEAD~5 HEAD # Compare versions
```
**Language-specific tools:**
- Rust: `cargo tree`, `cargo clippy`, `cargo doc`
- Python: `pytest`, `mypy`, `pylint`
- JavaScript: `npm list`, `eslint`, `tsc`
### 8. Build and Maintain Terminology Glossary
**Critical for understanding:** Build a living terminology glossary from day 1, not as an afterthought.
**Why it matters:**
- Domain-specific terms (OCR: detection, recognition, CTC, NMS)
- Acronyms that need expansion (CLS, DB, IOU)
- Project-specific abstractions (custom types, internal concepts)
- Confusion between similar concepts (Orientation vs CLS, Detection vs Recognition)
**When to build:**
- **Initial collection**: During first reading pass (Steps 1-4) - collect terms as you encounter them
- **Deep dive enrichment**: During detailed module analysis (Step 10) - add technical details and context
- **Continuous maintenance**: Update as understanding deepens, add cross-references, refine definitions
**What to include:**
- **Definition**: Clear explanation of what the term means
- **Context**: Where and how it's used in the codebase
- **Code references**: File paths, function names, line numbers
- **Relationships**: Related terms, parent/child concepts, synonyms
- **Examples**: Usage examples, code snippets when helpful
**Organization:**
- Classify by domain (OCR terms, ML terms, project-specific)
- Group by abstraction level (concepts, algorithms, data structures, implementation)
- Cross-reference to architecture diagrams, API flows, module analysis
**Example structure:**
```markdown
## Terminology Glossary
### Domain-Specific Terms
- **Detection (检测)**: Locating text regions in images
- **Implementation**: `src/det.rs`
- **Related**: Recognition, NMS, DB
- **See**: `api-flow.md` section 4.1
### Data Structures
- **`Mat`**: Image matrix abstraction
- **Implementation**: `src/image_impl.rs`
- **Purpose**: Unified image representation for Pure Rust and OpenCV backends
### Algorithms
- **NMS (Non-Maximum Suppression)**: Algorithm for filtering overlapping boxes
- **Implementation**: `src/geometry.rs::nms()`
- **Related**: IOU (Intersection over Union)
```
**Tools for term extraction:**
```bash
# Extract acronyms (all caps, 2-5 chars)
rg "\b[A-Z]{2,5}\b" README.md docs/ | sort -u
# Extract struct/enum/trait names
rg "^(pub )?(struct|enum|trait|type) \w+" src/ -o | sort -u
# Find config-related terms
rg "Config.*\{|struct \w+Config" src/
```
**Maintenance checklist:**
- [ ] All acronyms have expansions
- [ ] All domain terms have clear definitions
- [ ] All key data structures are documented
- [ ] Cross-references to code are accurate
- [ ] Related terms are linked
- [ ] Examples provided for complex terms
## Common Patterns
### Pattern 1: Reading Without a Goal
**Problem:** Reading files alphabetically or by directory structure.
**Solution:** Start with a concrete task. Even if it's "understand how X works," make it specific.
### Pattern 2: Getting Lost in Details
**Problem:** Deep-diving into every module before understanding the flow.
**Solution:** First trace one complete execution path. Then dive into specific modules as needed.
### Pattern 3: Ignoring Tests
**Problem:** Treating tests as optional or skipping them.
**Solution:** Tests are the most reliable documentation. Read them first. If missing, write characterization tests.
### Pattern 4: Not Using Git History
**Problem:** Seeing "weird code" and assuming it's wrong.
**Solution:** Use `git blame` and `git log` to understand context. Code exists for reasons, even if not obvious.
### Pattern 5: Ignoring Terminology (Acronyms and Domain Terms)
**Problem:** Encountering acronyms (CTC, CLS, DB, NMS) or domain terms (Detection vs Recognition) and assuming you'll remember what they mean later. Repeatedly looking up the same terms.
**Solution:** Build a terminology glossary from day 1. Collect terms as you encounter them, enrich with context as understanding deepens. Cross-reference terms throughout documentation. Treat it as a living document, not a one-time task.
## Success Indicators
You've successfully understood a codebase when:
✅ You can explain the main execution flow from entry to exit
✅ You can locate where specific functionality is implemented
✅ You can trace data transformations through the system
✅ You can identify where to make changes for your goal
✅ You can explain architectural decisions (even if you'd do differently)
✅ You have a comprehensive terminology glossary that helps navigate the codebase
✅ You can explain any domain-specific term in one sentence
✅ You rarely need to re-lookup the same term
✅ You feel confident modifying code without breaking things
**Not required:**
- ❌ Having read every file
- ❌ Memorizing every function
- ❌ Understanding every detail
## Troubleshooting
**"I don't know where to start"**
→ Define a concrete goal. Even "understand how user authentication works" is better than "understand everything."
**"The codebase is too large"**
→ Focus on your goal. Trace one execution path. Ignore unrelated modules.
**"I can't find the entry point"**
→ Look for `main()`, route definitions, or public API functions. Use code search tools.
**"The code doesn't make sense"**
→ Use Git history to understand why it's written this way. Check tests for usage examples.
**"There are no tests"**
→ Write characterization tests. Record current behavior before modifying.
**"I keep forgetting what CTC/CLS/DB means"**
→ Build a terminology glossary. Start early, collect terms as you encounter them. Include definitions, context, code references, and relationships. Cross-reference throughout documentation.
**"The domain terms are confusing"**
→ Don't just look them up once. Document them with context, usage examples, and relationships to other terms. Classify by domain and abstraction level. Update as understanding deepens.
## References
For detailed methodology and examples, see:
- [methodology.md](references/methodology.md) - Complete 11-step methodology with detailed explanations
- [tool-checklist.md](references/tool-checklist.md) - Comprehensive tool checklist by category (search, Git, language-specific, debugging)
- [terminology-building.md](references/terminology-building.md) - **Complete guide to building and maintaining terminology glossaries** (5-step process, classification methods, best practices, common patterns)
This skill teaches a systematic, goal-driven approach to reading and understanding large codebases efficiently. It emphasizes reading to modify safely, tracing real execution paths, and building a living terminology glossary. The outcome is confident, verifiable understanding rather than memorizing every file.
Start with a single concrete objective and run a minimal path to convert guesses into facts. Map architecture at several levels (system, containers, components), then trace one real request or execution path using a debugger or logging. Capture findings in lightweight documents: overall index, main methodology, architecture map, execution flows, and key-module analyses while maintaining an evolving glossary of domain terms.
What if I can't find the entry point?
Search for common entry symbols (main, route definitions, CLI parsers) and use code search tools; if still unclear, run the project and observe where execution begins.
How do I handle undocumented domain acronyms?
Collect them immediately into the terminology glossary with definitions, code references, and examples; update the glossary as understanding deepens.