home / skills / a5c-ai / babysitter / lexer-generator

This skill generates and hand-writes lexers using DFA, table-driven, and recursive approaches to speed up language tooling.

npx playbooks add skill a5c-ai/babysitter --skill lexer-generator

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
2.1 KB
---
name: Lexer Generator
description: Expert skill for generating and hand-writing lexers using DFA-based, table-driven, and recursive approaches
category: Compiler Frontend
allowed-tools:
  - Read
  - Write
  - Edit
  - Glob
  - Grep
  - Bash
---

# Lexer Generator Skill

## Overview

Expert skill for generating and hand-writing lexers using various approaches including DFA-based lexers, table-driven lexers, and hand-written recursive lexers.

## Capabilities

- Generate lexer from regular expression specifications
- Implement maximal munch tokenization
- Handle Unicode character classes and normalization
- Implement efficient keyword recognition (tries, perfect hashing)
- Support incremental/resumable lexing for IDE integration
- Generate lexer tables and state machines
- Handle lexer modes and contexts (e.g., string interpolation)
- Implement error recovery with skip-to-next strategies

## Target Processes

- lexer-implementation.js
- language-grammar-design.js
- lsp-server-implementation.js
- repl-development.js

## Dependencies

- Flex-like generators
- RE2/Hyperscan libraries

## Usage Guidelines

1. **Token Definition**: Start by defining the complete set of tokens with their regex patterns
2. **Maximal Munch**: Always implement maximal munch to handle ambiguous token boundaries
3. **Unicode Support**: Consider Unicode normalization forms and character classes from the start
4. **Error Recovery**: Implement skip-to-next-valid strategies for robust error handling
5. **Performance**: Use table-driven approaches for large token sets, hand-written for simple lexers

## Output Schema

```json
{
  "type": "object",
  "properties": {
    "tokens": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "pattern": { "type": "string" },
          "priority": { "type": "integer" }
        }
      }
    },
    "lexerType": {
      "type": "string",
      "enum": ["dfa", "table-driven", "hand-written"]
    },
    "generatedFiles": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}
```

Overview

This skill generates and hand-writes lexers using DFA-based, table-driven, and recursive techniques. It produces deterministic, high-performance tokenizers with support for maximal munch, Unicode handling, lexer modes, and resumable lexing for IDEs. The skill is focused on practical, production-ready implementations suitable for compilers, language servers, and REPLs. Outputs include token schemas, lexer tables, and ready-to-use JavaScript source files.

How this skill works

Provide a complete token specification using regular expressions and priorities; the skill converts these into DFA state machines or table-driven tables, or emits a hand-written recursive lexer when simpler logic is preferable. It implements maximal munch by resolving ambiguities with longest-match and priority rules, and supports Unicode normalization and character classes. For performance it can generate compact transition tables, keyword tries, or perfect-hash based matchers. It also emits integration helpers for incremental/resumable lexing and error-recovery strategies like skip-to-next.

When to use it

  • Building a compiler front-end where deterministic performance and minimal memory are required
  • Implementing an LSP server or IDE plugin that must support incremental lexing and fast edits
  • Creating a domain-specific language with complex token rules, interpolation, or nested modes
  • Prototyping a new language grammar and needing both generated and hand-written lexer options
  • Optimizing tokenization for large token sets using table-driven or DFA approaches

Best practices

  • Start with a complete, ordered token list including explicit priorities to avoid ambiguity
  • Always enforce maximal munch to resolve overlapping patterns consistently
  • Design Unicode handling from the start: choose normalization form and include Unicode classes
  • Use table-driven DFAs for many tokens; prefer hand-written lexers for small, mode-heavy grammars
  • Implement skip-to-next and synch points for robust error recovery in interactive tools

Example use cases

  • Generate a DFA-based JavaScript lexer for a compiled language with thousands of token patterns
  • Emit a table-driven lexer plus transition tables for embedding in a small, fast runtime
  • Write a recursive, hand-crafted lexer for a templating language with nested interpolation
  • Add resumable lexing to an LSP server so syntax highlighting updates quickly on edits
  • Create keyword recognition using tries or perfect hashing for fast identifier vs keyword checks

FAQ

Which lexer type should I choose for a new language?

Use table-driven/DFA for many tokens and strict performance needs; choose hand-written recursive lexers when grammar includes complex mode switches or when implementation simplicity matters.

How is Unicode handled?

The skill supports Unicode character classes and normalization; pick a normalization form early and include Unicode-aware regexes or character-range handling in the token specs.