home / skills / anton-abyzov / specweave / nlp-pipeline-builder

nlp-pipeline-builder skill

/plugins/specweave-ml/skills/nlp-pipeline-builder

This skill helps you build and optimize production NLP pipelines with transformers for classification, NER, sentiment, and generation.

This is most likely a fork of the sw-nlp-pipeline-builder skill from openclaw
npx playbooks add skill anton-abyzov/specweave --skill nlp-pipeline-builder

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.1 KB
---
name: nlp-pipeline-builder
description: |
  Natural language processing ML pipelines for text classification, NER, sentiment analysis, text generation, and embeddings. Activates for "nlp", "text classification", "sentiment analysis", "named entity recognition", "BERT", "transformers", "text preprocessing", "tokenization", "word embeddings". Builds NLP pipelines with transformers, integrated with SpecWeave increments.
---

# NLP Pipeline Builder

## Overview

Specialized ML pipelines for natural language processing. Handles text preprocessing, tokenization, transformer models (BERT, RoBERTa, GPT), fine-tuning, and deployment for production NLP systems.

## NLP Tasks Supported

### 1. Text Classification

```python
from specweave import NLPPipeline

# Binary or multi-class text classification
pipeline = NLPPipeline(
    task="classification",
    classes=["positive", "negative", "neutral"],
    increment="0042"
)

# Automatically configures:
# - Text preprocessing (lowercase, clean)
# - Tokenization (BERT tokenizer)
# - Model (BERT, RoBERTa, DistilBERT)
# - Fine-tuning on your data
# - Inference pipeline

pipeline.fit(train_texts, train_labels)
```

### 2. Named Entity Recognition (NER)

```python
# Extract entities from text
pipeline = NLPPipeline(
    task="ner",
    entities=["PERSON", "ORG", "LOC", "DATE"],
    increment="0042"
)

# Returns: [(entity_text, entity_type, start_pos, end_pos), ...]
```

### 3. Sentiment Analysis

```python
# Sentiment classification (specialized)
pipeline = NLPPipeline(
    task="sentiment",
    increment="0042"
)

# Fine-tuned for sentiment (positive/negative/neutral)
```

### 4. Text Generation

```python
# Generate text continuations
pipeline = NLPPipeline(
    task="generation",
    model="gpt2",
    increment="0042"
)

# Fine-tune on your domain-specific text
```

## Best Practices for NLP

### Text Preprocessing

```python
from specweave import TextPreprocessor

preprocessor = TextPreprocessor(increment="0042")

# Standard preprocessing
preprocessor.add_steps([
    "lowercase",
    "remove_html",
    "remove_urls",
    "remove_emails",
    "remove_special_chars",
    "remove_extra_whitespace"
])

# Advanced preprocessing
preprocessor.add_advanced([
    "spell_correction",
    "lemmatization",
    "stopword_removal"
])
```

### Model Selection

**Text Classification**:
- Small datasets (<10K): DistilBERT (6x faster than BERT)
- Medium datasets (10K-100K): BERT-base
- Large datasets (>100K): RoBERTa-large

**NER**:
- General: BERT + CRF layer
- Domain-specific: Fine-tune BERT on domain corpus

**Sentiment**:
- Product reviews: DistilBERT fine-tuned on Amazon reviews
- Social media: RoBERTa fine-tuned on Twitter

### Transfer Learning

```python
# Start from pre-trained language models
pipeline = NLPPipeline(task="classification")

# Option 1: Use pre-trained (no fine-tuning)
pipeline.use_pretrained("distilbert-base-uncased")

# Option 2: Fine-tune on your data
pipeline.use_pretrained_and_finetune(
    model="bert-base-uncased",
    epochs=3,
    learning_rate=2e-5
)
```

### Handling Long Text

```python
# For text longer than 512 tokens
pipeline = NLPPipeline(
    task="classification",
    max_length=512,
    truncation_strategy="head_and_tail"  # Keep start + end
)

# Or use Longformer for long documents
pipeline.use_model("longformer")  # Handles 4096 tokens
```

## Integration with SpecWeave

```python
# NLP increment structure
.specweave/increments/0042-sentiment-classifier/
├── spec.md
├── data/
│   ├── train.csv
│   ├── val.csv
│   └── test.csv
├── models/
│   ├── tokenizer/
│   ├── model-epoch-1/
│   ├── model-epoch-2/
│   └── model-epoch-3/
├── experiments/
│   ├── distilbert-baseline/
│   ├── bert-base-finetuned/
│   └── roberta-large/
└── deployment/
    ├── model.onnx
    └── inference.py
```

## Commands

```bash
/ml:nlp-pipeline --task classification --model bert-base
/ml:nlp-evaluate 0042  # Evaluate on test set
/ml:nlp-deploy 0042    # Export for production
```

Quick setup for NLP projects with state-of-the-art transformer models.

Overview

This skill builds end-to-end NLP pipelines for production text tasks using transformers and SpecWeave increments. It supports text classification, named entity recognition (NER), sentiment analysis, text generation, and embeddings, with automated preprocessing, tokenization, fine-tuning, and deployment. The implementation is TypeScript-friendly and designed to integrate with SpecWeave increment workflows for traceable experiments and deployment artifacts.

How this skill works

The skill configures a pipeline object for your chosen task, auto-applying preprocessing steps (cleaning, lowercasing, optional lemmatization), selecting tokenizers, and wiring transformer models (BERT, RoBERTa, DistilBERT, Longformer, GPT variants). It supports both using pretrained weights and fine-tuning on your dataset, produces model checkpoints and inference artifacts, and exports deployment-ready formats (e.g., ONNX + inference script) organized per SpecWeave increment. Evaluation and deployment commands automate test runs and packaging.

When to use it

  • Building production-grade text classifiers for binary or multi-class tasks
  • Extracting structured entities from documents with NER models
  • Deploying sentiment analysis tuned to product reviews or social media
  • Generating domain-specific text or continuations via fine-tuned generation models
  • Creating embeddings for search, clustering, or downstream retrieval tasks

Best practices

  • Start with a strong pretrained model and fine-tune on domain data when possible to improve accuracy
  • Apply consistent preprocessing steps (remove HTML/URLs, lowercase, normalize whitespace) and document them in the increment
  • Choose model size to match dataset scale: DistilBERT for small, BERT-base for medium, RoBERTa-large for very large datasets
  • Use Longformer or chunking/truncation strategies for documents longer than 512 tokens
  • Version experiments and artifacts inside a SpecWeave increment to keep training, data, and deployment reproducible

Example use cases

  • Customer support triage: multi-class classification to route tickets to the right team
  • Contract analysis: NER to extract parties, dates, and obligations from legal texts
  • Brand monitoring: sentiment analysis on social media streams fine-tuned for slang and emojis
  • Domain-specific content generation: fine-tune GPT variants on company knowledge for synopsis generation
  • Semantic search: build embeddings to power similarity search and recommendation systems

FAQ

Can I use this with very long documents?

Yes — either use models that handle long contexts (Longformer) or apply truncation strategies like head_and_tail and chunking with aggregation.

Does it support production exports?

Yes — pipelines produce deployment artifacts (model.onnx, inference.py) and organize them inside a SpecWeave increment for reproducible deployment.