home / skills / a5c-ai / babysitter / lexer-generator

lexer-generator skill

safe

/plugins/babysitter/skills/babysit/process/specializations/programming-languages/skills/lexer-generator

This skill generates and hand-writes lexers using DFA, table-driven, and recursive approaches to speed up language tooling.

npx playbooks add skill a5c-ai/babysitter --skill lexer-generator

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

2.1 KB

---
name: Lexer Generator
description: Expert skill for generating and hand-writing lexers using DFA-based, table-driven, and recursive approaches
category: Compiler Frontend
allowed-tools:
  - Read
  - Write
  - Edit
  - Glob
  - Grep
  - Bash
---

# Lexer Generator Skill

## Overview

Expert skill for generating and hand-writing lexers using various approaches including DFA-based lexers, table-driven lexers, and hand-written recursive lexers.

## Capabilities

- Generate lexer from regular expression specifications
- Implement maximal munch tokenization
- Handle Unicode character classes and normalization
- Implement efficient keyword recognition (tries, perfect hashing)
- Support incremental/resumable lexing for IDE integration
- Generate lexer tables and state machines
- Handle lexer modes and contexts (e.g., string interpolation)
- Implement error recovery with skip-to-next strategies

## Target Processes

- lexer-implementation.js
- language-grammar-design.js
- lsp-server-implementation.js
- repl-development.js

## Dependencies

- Flex-like generators
- RE2/Hyperscan libraries

## Usage Guidelines

1. **Token Definition**: Start by defining the complete set of tokens with their regex patterns
2. **Maximal Munch**: Always implement maximal munch to handle ambiguous token boundaries
3. **Unicode Support**: Consider Unicode normalization forms and character classes from the start
4. **Error Recovery**: Implement skip-to-next-valid strategies for robust error handling
5. **Performance**: Use table-driven approaches for large token sets, hand-written for simple lexers

## Output Schema

```json
{
  "type": "object",
  "properties": {
    "tokens": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "pattern": { "type": "string" },
          "priority": { "type": "integer" }
        }
      }
    },
    "lexerType": {
      "type": "string",
      "enum": ["dfa", "table-driven", "hand-written"]
    },
    "generatedFiles": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}
```

Overview

This skill generates and hand-writes lexers using DFA-based, table-driven, and recursive techniques. It produces deterministic, high-performance tokenizers with support for maximal munch, Unicode handling, lexer modes, and resumable lexing for IDEs. The skill is focused on practical, production-ready implementations suitable for compilers, language servers, and REPLs. Outputs include token schemas, lexer tables, and ready-to-use JavaScript source files.

How this skill works

Provide a complete token specification using regular expressions and priorities; the skill converts these into DFA state machines or table-driven tables, or emits a hand-written recursive lexer when simpler logic is preferable. It implements maximal munch by resolving ambiguities with longest-match and priority rules, and supports Unicode normalization and character classes. For performance it can generate compact transition tables, keyword tries, or perfect-hash based matchers. It also emits integration helpers for incremental/resumable lexing and error-recovery strategies like skip-to-next.

When to use it

Building a compiler front-end where deterministic performance and minimal memory are required
Implementing an LSP server or IDE plugin that must support incremental lexing and fast edits
Creating a domain-specific language with complex token rules, interpolation, or nested modes
Prototyping a new language grammar and needing both generated and hand-written lexer options
Optimizing tokenization for large token sets using table-driven or DFA approaches

Best practices

Start with a complete, ordered token list including explicit priorities to avoid ambiguity
Always enforce maximal munch to resolve overlapping patterns consistently
Design Unicode handling from the start: choose normalization form and include Unicode classes
Use table-driven DFAs for many tokens; prefer hand-written lexers for small, mode-heavy grammars
Implement skip-to-next and synch points for robust error recovery in interactive tools

Example use cases

Generate a DFA-based JavaScript lexer for a compiled language with thousands of token patterns
Emit a table-driven lexer plus transition tables for embedding in a small, fast runtime
Write a recursive, hand-crafted lexer for a templating language with nested interpolation
Add resumable lexing to an LSP server so syntax highlighting updates quickly on edits
Create keyword recognition using tries or perfect hashing for fast identifier vs keyword checks

FAQ

Which lexer type should I choose for a new language?

Use table-driven/DFA for many tokens and strict performance needs; choose hand-written recursive lexers when grammar includes complex mode switches or when implementation simplicity matters.

How is Unicode handled?

The skill supports Unicode character classes and normalization; pick a normalization form early and include Unicode-aware regexes or character-range handling in the token specs.