home / skills / pluginagentmarketplace / custom-plugin-ai-data-scientist / nlp-processing

nlp-processing skill

/skills/nlp-processing

This skill enables advanced natural language processing tasks such as text preprocessing, sentiment analysis, NER, and language model integration for analytics.

npx playbooks add skill pluginagentmarketplace/custom-plugin-ai-data-scientist --skill nlp-processing

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
5.0 KB
---
name: nlp-processing
description: Text processing, sentiment analysis, LLMs, and NLP frameworks. Use for text classification, named entity recognition, or language models.
sasmp_version: "1.3.0"
bonded_agent: 04-machine-learning-ai
bond_type: SECONDARY_BOND
---

# Natural Language Processing

Process, analyze, and understand text data with modern NLP techniques.

## Quick Start

### Text Preprocessing
```python
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def preprocess_text(text):
    # Lowercase
    text = text.lower()

    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if w not in stop_words]

    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]

    return ' '.join(tokens)
```

### Sentiment Analysis
```python
from transformers import pipeline

# Pre-trained model
sentiment_analyzer = pipeline("sentiment-analysis")
result = sentiment_analyzer("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Custom model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(documents)

model = LogisticRegression()
model.fit(X, labels)
```

## TF-IDF Vectorization

```python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),  # Unigrams and bigrams
    min_df=2,            # Minimum document frequency
    max_df=0.8           # Maximum document frequency
)

X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
```

## Named Entity Recognition

```python
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple Inc. was founded by Steve Jobs in California.")

for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")
# Apple Inc.: ORG
# Steve Jobs: PERSON
# California: GPE
```

## BERT for Text Classification

```python
from transformers import (
    BertTokenizer, BertForSequenceClassification,
    Trainer, TrainingArguments
)

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

# Tokenize
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

trainer.train()
```

## Text Generation with GPT

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

input_text = "The future of AI is"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

output = model.generate(
    input_ids,
    max_length=50,
    num_return_sequences=1,
    temperature=0.7,
    top_k=50,
    top_p=0.95
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
```

## Topic Modeling with LDA

```python
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000, max_df=0.8, min_df=2)
X = vectorizer.fit_transform(documents)

lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(X)

# Display topics
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words = [feature_names[i] for i in topic.argsort()[-10:]]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")
```

## Word Embeddings

```python
from gensim.models import Word2Vec

# Train Word2Vec
sentences = [word_tokenize(doc) for doc in documents]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Get vector
vector = model.wv['king']

# Find similar words
similar = model.wv.most_similar('king', topn=5)
```

## Common Tasks

**Text Classification:**
- Sentiment analysis
- Spam detection
- Intent classification
- Topic categorization

**Sequence Labeling:**
- Named Entity Recognition (NER)
- Part-of-Speech (POS) tagging
- Keyword extraction

**Generation:**
- Text summarization
- Machine translation
- Chatbots
- Code generation

## Best Practices

1. Clean text (remove noise, normalize)
2. Handle class imbalance
3. Use pre-trained models when possible
4. Fine-tune on domain-specific data
5. Validate with diverse test data
6. Monitor for bias and fairness

Overview

This skill provides a practical toolkit for text processing, sentiment analysis, and working with modern NLP frameworks and language models. It combines preprocessing utilities, vectorization, sequence labeling, embedding methods, and end-to-end examples for classification and generation. Use it to accelerate prototyping and production workflows for NLP tasks.

How this skill works

The skill exposes common pipelines: text cleaning and tokenization, TF-IDF and count-based vectorization, topic modeling with LDA, and embedding training with Word2Vec. It also shows how to run or fine-tune transformer models for sentiment analysis, BERT classification, and GPT-style text generation. Sequence-labeling examples demonstrate named entity recognition with spaCy and evaluation-ready Trainer workflows for supervised tasks.

When to use it

  • When you need robust preprocessing for noisy text before modeling.
  • To build or evaluate text classifiers such as sentiment, intent, or spam detection.
  • To extract entities or POS tags with production-ready NER pipelines.
  • For experimenting with topic modeling or unsupervised content analysis.
  • When prototyping language-model generation or fine-tuning transformers on domain data.

Best practices

  • Clean and normalize text early: lowercase, remove noise, tokenize, remove stopwords, and lemmatize.
  • Prefer pre-trained transformer models and fine-tune on domain-specific labeled data where possible.
  • Use TF-IDF or embeddings depending on task: TF-IDF for lightweight classification, embeddings for semantic tasks.
  • Address class imbalance via resampling or class-weighted loss and validate on diverse test sets.
  • Monitor models for bias and fairness and include human review for sensitive applications.

Example use cases

  • Fine-tune BERT for binary sentiment classification on product reviews.
  • Build a spaCy NER pipeline to extract company names, people, and locations from news articles.
  • Generate draft copy or creative continuations using GPT variants with controlled sampling parameters.
  • Run LDA topic modeling to discover main themes in a large corpus of customer feedback.
  • Train Word2Vec embeddings from domain text to improve semantic search and downstream clustering.

FAQ

Which vectorization approach should I pick for classification?

Use TF-IDF for fast, interpretable baselines and small datasets; use embeddings or transformer encodings when semantic understanding or transfer learning is required.

How do I handle long documents with transformers?

Truncate or chunk long texts with overlap, use hierarchical models, or leverage transformer variants designed for long context (e.g., Longformer).