home / skills / pluginagentmarketplace / custom-plugin-ai-data-scientist / nlp-processing

nlp-processing skill

safe

This skill enables advanced natural language processing tasks such as text preprocessing, sentiment analysis, NER, and language model integration for analytics.

npx playbooks add skill pluginagentmarketplace/custom-plugin-ai-data-scientist --skill nlp-processing

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

5.0 KB

---
name: nlp-processing
description: Text processing, sentiment analysis, LLMs, and NLP frameworks. Use for text classification, named entity recognition, or language models.
sasmp_version: "1.3.0"
bonded_agent: 04-machine-learning-ai
bond_type: SECONDARY_BOND
---

# Natural Language Processing

Process, analyze, and understand text data with modern NLP techniques.

## Quick Start

### Text Preprocessing
```python
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def preprocess_text(text):
    # Lowercase
    text = text.lower()

    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if w not in stop_words]

    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]

    return ' '.join(tokens)
```

### Sentiment Analysis
```python
from transformers import pipeline

# Pre-trained model
sentiment_analyzer = pipeline("sentiment-analysis")
result = sentiment_analyzer("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Custom model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(documents)

model = LogisticRegression()
model.fit(X, labels)
```

## TF-IDF Vectorization

```python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),  # Unigrams and bigrams
    min_df=2,            # Minimum document frequency
    max_df=0.8           # Maximum document frequency
)

X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
```

## Named Entity Recognition

```python
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple Inc. was founded by Steve Jobs in California.")

for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")
# Apple Inc.: ORG
# Steve Jobs: PERSON
# California: GPE
```

## BERT for Text Classification

```python
from transformers import (
    BertTokenizer, BertForSequenceClassification,
    Trainer, TrainingArguments
)

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

# Tokenize
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

trainer.train()
```

## Text Generation with GPT

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

input_text = "The future of AI is"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

output = model.generate(
    input_ids,
    max_length=50,
    num_return_sequences=1,
    temperature=0.7,
    top_k=50,
    top_p=0.95
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
```

## Topic Modeling with LDA

```python
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000, max_df=0.8, min_df=2)
X = vectorizer.fit_transform(documents)

lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(X)

# Display topics
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words = [feature_names[i] for i in topic.argsort()[-10:]]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")
```

## Word Embeddings

```python
from gensim.models import Word2Vec

# Train Word2Vec
sentences = [word_tokenize(doc) for doc in documents]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Get vector
vector = model.wv['king']

# Find similar words
similar = model.wv.most_similar('king', topn=5)
```

## Common Tasks

**Text Classification:**
- Sentiment analysis
- Spam detection
- Intent classification
- Topic categorization

**Sequence Labeling:**
- Named Entity Recognition (NER)
- Part-of-Speech (POS) tagging
- Keyword extraction

**Generation:**
- Text summarization
- Machine translation
- Chatbots
- Code generation

## Best Practices

1. Clean text (remove noise, normalize)
2. Handle class imbalance
3. Use pre-trained models when possible
4. Fine-tune on domain-specific data
5. Validate with diverse test data
6. Monitor for bias and fairness

Overview

This skill provides a practical toolkit for text processing, sentiment analysis, and working with modern NLP frameworks and language models. It combines preprocessing utilities, vectorization, sequence labeling, embedding methods, and end-to-end examples for classification and generation. Use it to accelerate prototyping and production workflows for NLP tasks.

How this skill works

The skill exposes common pipelines: text cleaning and tokenization, TF-IDF and count-based vectorization, topic modeling with LDA, and embedding training with Word2Vec. It also shows how to run or fine-tune transformer models for sentiment analysis, BERT classification, and GPT-style text generation. Sequence-labeling examples demonstrate named entity recognition with spaCy and evaluation-ready Trainer workflows for supervised tasks.

When to use it

When you need robust preprocessing for noisy text before modeling.
To build or evaluate text classifiers such as sentiment, intent, or spam detection.
To extract entities or POS tags with production-ready NER pipelines.
For experimenting with topic modeling or unsupervised content analysis.
When prototyping language-model generation or fine-tuning transformers on domain data.

Best practices

Clean and normalize text early: lowercase, remove noise, tokenize, remove stopwords, and lemmatize.
Prefer pre-trained transformer models and fine-tune on domain-specific labeled data where possible.
Use TF-IDF or embeddings depending on task: TF-IDF for lightweight classification, embeddings for semantic tasks.
Address class imbalance via resampling or class-weighted loss and validate on diverse test sets.
Monitor models for bias and fairness and include human review for sensitive applications.

Example use cases

Fine-tune BERT for binary sentiment classification on product reviews.
Build a spaCy NER pipeline to extract company names, people, and locations from news articles.
Generate draft copy or creative continuations using GPT variants with controlled sampling parameters.
Run LDA topic modeling to discover main themes in a large corpus of customer feedback.
Train Word2Vec embeddings from domain text to improve semantic search and downstream clustering.

FAQ

Which vectorization approach should I pick for classification?

Use TF-IDF for fast, interpretable baselines and small datasets; use embeddings or transformer encodings when semantic understanding or transfer learning is required.

How do I handle long documents with transformers?

Truncate or chunk long texts with overlap, use hierarchical models, or leverage transformer variants designed for long context (e.g., Longformer).