home / skills / dkyazzentwatwa / chatgpt-skills / keyword-extractor
This skill extracts keywords and key phrases from text using TF-IDF, RAKE, and frequency analysis, with optional word clouds and exports.
npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill keyword-extractorReview the files below or copy the command above to add this skill to your agents.
---
name: keyword-extractor
description: Extract keywords and key phrases from text using TF-IDF, RAKE, and frequency analysis. Generate word clouds and export to various formats.
---
# Keyword Extractor
Extract important keywords and key phrases from text documents using multiple algorithms. Supports TF-IDF, RAKE, and simple frequency analysis with word cloud visualization.
## Quick Start
```python
from scripts.keyword_extractor import KeywordExtractor
# Extract keywords
extractor = KeywordExtractor()
keywords = extractor.extract("Your long text document here...")
print(keywords[:10]) # Top 10 keywords
# From file
keywords = extractor.extract_from_file("document.txt")
extractor.to_wordcloud("keywords.png")
```
## Features
- **Multiple Algorithms**: TF-IDF, RAKE, frequency-based
- **Key Phrases**: Extract multi-word phrases, not just single words
- **Scoring**: Relevance scores for ranking
- **Stopword Filtering**: Built-in + custom stopwords
- **N-gram Support**: Unigrams, bigrams, trigrams
- **Word Cloud**: Visualize keyword importance
- **Batch Processing**: Process multiple documents
## API Reference
### Initialization
```python
extractor = KeywordExtractor(
method="tfidf", # tfidf, rake, frequency
max_keywords=20, # Maximum keywords to return
min_word_length=3, # Minimum word length
ngram_range=(1, 3) # Unigrams to trigrams
)
```
### Extraction Methods
```python
# TF-IDF (best for comparing documents)
keywords = extractor.extract(text, method="tfidf")
# RAKE (best for key phrases)
keywords = extractor.extract(text, method="rake")
# Frequency (simple word counts)
keywords = extractor.extract(text, method="frequency")
```
### Results Format
```python
keywords = extractor.extract(text)
# Returns list of tuples: [(keyword, score), ...]
# [('machine learning', 0.85), ('data science', 0.72), ...]
# Get just keywords
keyword_list = extractor.get_keywords(text)
# ['machine learning', 'data science', ...]
```
### Customization
```python
# Add custom stopwords
extractor.add_stopwords(['company', 'product', 'service'])
# Set minimum frequency
extractor.min_frequency = 2
# Filter by part of speech (nouns only)
extractor.pos_filter = ['NN', 'NNS', 'NNP']
```
### Visualization
```python
# Generate word cloud
extractor.to_wordcloud("wordcloud.png", colormap="viridis")
# Bar chart of top keywords
extractor.plot_keywords("keywords.png", top_n=15)
```
### Export
```python
# To JSON
extractor.to_json("keywords.json")
# To CSV
extractor.to_csv("keywords.csv")
# To plain text
extractor.to_text("keywords.txt")
```
## CLI Usage
```bash
# Extract from text
python keyword_extractor.py --text "Your text here" --top 10
# Extract from file
python keyword_extractor.py --input document.txt --method tfidf --output keywords.json
# Generate word cloud
python keyword_extractor.py --input document.txt --wordcloud cloud.png
# Batch process directory
python keyword_extractor.py --input-dir ./docs --output keywords_all.csv
```
### CLI Arguments
| Argument | Description | Default |
|----------|-------------|---------|
| `--text` | Text to analyze | - |
| `--input` | Input file path | - |
| `--input-dir` | Directory of files | - |
| `--output` | Output file | - |
| `--method` | Algorithm (tfidf, rake, frequency) | `tfidf` |
| `--top` | Number of keywords | 20 |
| `--ngrams` | N-gram range (e.g., "1,2") | `1,3` |
| `--wordcloud` | Generate word cloud | - |
| `--stopwords` | Custom stopwords file | - |
## Examples
### Article Keyword Extraction
```python
extractor = KeywordExtractor(method="tfidf")
article = """
Machine learning is transforming data science. Deep learning models
are achieving state-of-the-art results in natural language processing
and computer vision. Neural networks continue to advance...
"""
keywords = extractor.extract(article, top_n=10)
for keyword, score in keywords:
print(f"{score:.3f}: {keyword}")
```
### Compare Multiple Documents
```python
extractor = KeywordExtractor(method="tfidf")
docs = [
open("doc1.txt").read(),
open("doc2.txt").read(),
open("doc3.txt").read()
]
# Extract keywords from each
for i, doc in enumerate(docs):
keywords = extractor.extract(doc, top_n=5)
print(f"\nDocument {i+1}:")
for kw, score in keywords:
print(f" {kw}: {score:.3f}")
```
### SEO Keyword Research
```python
extractor = KeywordExtractor(
method="rake",
ngram_range=(2, 4), # Focus on phrases
max_keywords=30
)
webpage_content = open("page.html").read()
keywords = extractor.extract(webpage_content)
# Filter by score threshold
high_value = [(kw, s) for kw, s in keywords if s > 0.5]
print("High-value keywords for SEO:")
for kw, score in high_value:
print(f" {kw}")
```
## Algorithm Comparison
| Algorithm | Best For | Strengths |
|-----------|----------|-----------|
| **TF-IDF** | Document comparison | Finds unique terms, good for search |
| **RAKE** | Key phrases | Extracts multi-word concepts |
| **Frequency** | Quick overview | Simple, fast, interpretable |
## Dependencies
```
scikit-learn>=1.2.0
nltk>=3.8.0
pandas>=2.0.0
matplotlib>=3.7.0
wordcloud>=1.9.0
```
## Limitations
- English optimized (other languages need language-specific stopwords)
- Very short texts may not have enough data for TF-IDF
- Domain-specific jargon may need custom stopword handling
This skill extracts important keywords and multi-word key phrases from text using TF-IDF, RAKE, and simple frequency analysis. It includes scoring, n-gram support, built-in and custom stopword filtering, and visual exports such as word clouds and bar charts. The tool supports batch processing and multiple export formats for downstream analysis.
The extractor analyzes input text with selectable algorithms: TF-IDF for document-unique terms, RAKE for multi-word phrase detection, and frequency counts for fast overviews. It returns ranked (keyword, score) tuples and can filter by n-gram range, minimum word length, part-of-speech tags, and custom stopwords. Results can be visualized as word clouds or plots and exported to JSON, CSV, or plain text.
Which algorithm should I pick for SEO phrase discovery?
Use RAKE with an n-gram range that includes bigrams and trigrams; RAKE is optimized for multi-word phrases often useful in SEO.
How do I handle domain-specific jargon?
Add custom stopwords and adjust min_frequency or min_word_length so common jargon that is irrelevant is filtered out; you can also restrict to noun POS tags.