home / skills / pluginagentmarketplace / custom-plugin-ai-data-scientist / nlp-processing
This skill enables advanced natural language processing tasks such as text preprocessing, sentiment analysis, NER, and language model integration for analytics.
npx playbooks add skill pluginagentmarketplace/custom-plugin-ai-data-scientist --skill nlp-processingReview the files below or copy the command above to add this skill to your agents.
---
name: nlp-processing
description: Text processing, sentiment analysis, LLMs, and NLP frameworks. Use for text classification, named entity recognition, or language models.
sasmp_version: "1.3.0"
bonded_agent: 04-machine-learning-ai
bond_type: SECONDARY_BOND
---
# Natural Language Processing
Process, analyze, and understand text data with modern NLP techniques.
## Quick Start
### Text Preprocessing
```python
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
def preprocess_text(text):
# Lowercase
text = text.lower()
# Remove special characters
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if w not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(w) for w in tokens]
return ' '.join(tokens)
```
### Sentiment Analysis
```python
from transformers import pipeline
# Pre-trained model
sentiment_analyzer = pipeline("sentiment-analysis")
result = sentiment_analyzer("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Custom model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(documents)
model = LogisticRegression()
model.fit(X, labels)
```
## TF-IDF Vectorization
```python
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
max_features=5000,
ngram_range=(1, 2), # Unigrams and bigrams
min_df=2, # Minimum document frequency
max_df=0.8 # Maximum document frequency
)
X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
```
## Named Entity Recognition
```python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple Inc. was founded by Steve Jobs in California.")
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
# Apple Inc.: ORG
# Steve Jobs: PERSON
# California: GPE
```
## BERT for Text Classification
```python
from transformers import (
BertTokenizer, BertForSequenceClassification,
Trainer, TrainingArguments
)
# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2
)
# Tokenize
def tokenize_function(examples):
return tokenizer(
examples['text'],
padding='max_length',
truncation=True,
max_length=128
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Train
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
evaluation_strategy='epoch'
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test']
)
trainer.train()
```
## Text Generation with GPT
```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
input_text = "The future of AI is"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
output = model.generate(
input_ids,
max_length=50,
num_return_sequences=1,
temperature=0.7,
top_k=50,
top_p=0.95
)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
```
## Topic Modeling with LDA
```python
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=1000, max_df=0.8, min_df=2)
X = vectorizer.fit_transform(documents)
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(X)
# Display topics
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
top_words = [feature_names[i] for i in topic.argsort()[-10:]]
print(f"Topic {topic_idx}: {', '.join(top_words)}")
```
## Word Embeddings
```python
from gensim.models import Word2Vec
# Train Word2Vec
sentences = [word_tokenize(doc) for doc in documents]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
# Get vector
vector = model.wv['king']
# Find similar words
similar = model.wv.most_similar('king', topn=5)
```
## Common Tasks
**Text Classification:**
- Sentiment analysis
- Spam detection
- Intent classification
- Topic categorization
**Sequence Labeling:**
- Named Entity Recognition (NER)
- Part-of-Speech (POS) tagging
- Keyword extraction
**Generation:**
- Text summarization
- Machine translation
- Chatbots
- Code generation
## Best Practices
1. Clean text (remove noise, normalize)
2. Handle class imbalance
3. Use pre-trained models when possible
4. Fine-tune on domain-specific data
5. Validate with diverse test data
6. Monitor for bias and fairness
This skill provides a practical toolkit for text processing, sentiment analysis, and working with modern NLP frameworks and language models. It combines preprocessing utilities, vectorization, sequence labeling, embedding methods, and end-to-end examples for classification and generation. Use it to accelerate prototyping and production workflows for NLP tasks.
The skill exposes common pipelines: text cleaning and tokenization, TF-IDF and count-based vectorization, topic modeling with LDA, and embedding training with Word2Vec. It also shows how to run or fine-tune transformer models for sentiment analysis, BERT classification, and GPT-style text generation. Sequence-labeling examples demonstrate named entity recognition with spaCy and evaluation-ready Trainer workflows for supervised tasks.
Which vectorization approach should I pick for classification?
Use TF-IDF for fast, interpretable baselines and small datasets; use embeddings or transformer encodings when semantic understanding or transfer learning is required.
How do I handle long documents with transformers?
Truncate or chunk long texts with overlap, use hierarchical models, or leverage transformer variants designed for long context (e.g., Longformer).