home / skills / dkyazzentwatwa / chatgpt-skills / named-entity-extractor

named-entity-extractor skill

/named-entity-extractor

This skill extracts named entities from text, enabling document analysis and enrichment through structured outputs.

npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill named-entity-extractor

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
6.4 KB
---
name: named-entity-extractor
description: Extract named entities (people, organizations, locations, dates) from text using NLP. Use for document analysis, information extraction, or data enrichment.
---

# Named Entity Extractor

Extract named entities from text including people, organizations, locations, dates, and more.

## Features

- **Entity Types**: People, organizations, locations, dates, money, percentages
- **Multiple Models**: spaCy for accuracy, regex for speed
- **Batch Processing**: Process multiple documents
- **Entity Linking**: Group same entities across text
- **Export**: JSON, CSV output formats
- **Visualization**: Entity highlighting

## Quick Start

```python
from entity_extractor import EntityExtractor

extractor = EntityExtractor()

text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."

entities = extractor.extract(text)
for entity in entities:
    print(f"{entity['text']}: {entity['type']}")

# Output:
# Apple Inc.: ORG
# Steve Jobs: PERSON
# Cupertino: GPE
# California: GPE
# 1976: DATE
```

## CLI Usage

```bash
# Extract from text
python entity_extractor.py --text "Steve Jobs founded Apple in California."

# Extract from file
python entity_extractor.py --input document.txt

# Batch process folder
python entity_extractor.py --input ./documents/ --output entities.csv

# Filter by entity type
python entity_extractor.py --input document.txt --types PERSON,ORG

# Use regex mode (faster, less accurate)
python entity_extractor.py --input document.txt --mode regex

# JSON output
python entity_extractor.py --input document.txt --json
```

## API Reference

### EntityExtractor Class

```python
class EntityExtractor:
    def __init__(self, mode: str = "spacy", model: str = "en_core_web_sm")

    # Extraction
    def extract(self, text: str) -> list
    def extract_file(self, filepath: str) -> list
    def extract_batch(self, folder: str) -> dict

    # Filtering
    def filter_entities(self, entities: list, types: list) -> list
    def get_unique_entities(self, entities: list) -> list
    def group_by_type(self, entities: list) -> dict

    # Analysis
    def entity_frequency(self, text: str) -> dict
    def find_relationships(self, text: str) -> list

    # Export
    def to_csv(self, entities: list, output: str) -> str
    def to_json(self, entities: list, output: str) -> str
    def highlight_text(self, text: str) -> str
```

## Entity Types

### Standard Entity Types (spaCy)

| Type | Description | Example |
|------|-------------|---------|
| PERSON | People, including fictional | "Steve Jobs" |
| ORG | Companies, agencies, institutions | "Apple Inc." |
| GPE | Countries, cities, states | "California" |
| LOC | Non-GPE locations, mountains, water | "Pacific Ocean" |
| DATE | Dates, periods | "January 2024" |
| TIME | Times | "3:30 PM" |
| MONEY | Monetary values | "$1.5 million" |
| PERCENT | Percentages | "20%" |
| PRODUCT | Products | "iPhone" |
| EVENT | Events | "World Cup" |
| WORK_OF_ART | Books, songs, etc. | "The Great Gatsby" |
| LAW | Laws, regulations | "GDPR" |
| LANGUAGE | Languages | "English" |
| NORP | Nationalities, groups | "American" |

### Regex Mode Entities

Faster extraction with regex patterns:

| Type | Description |
|------|-------------|
| EMAIL | Email addresses |
| PHONE | Phone numbers |
| URL | Web URLs |
| DATE | Common date formats |
| MONEY | Currency amounts |
| PERCENTAGE | Percentages |

## Output Format

### Entity Result

```python
{
    "text": "Steve Jobs",
    "type": "PERSON",
    "start": 10,
    "end": 20,
    "confidence": 0.95
}
```

### Full Extraction Result

```python
{
    "text": "Original text...",
    "entities": [
        {"text": "Steve Jobs", "type": "PERSON", "start": 10, "end": 20},
        {"text": "Apple Inc.", "type": "ORG", "start": 30, "end": 40}
    ],
    "summary": {
        "total_entities": 5,
        "unique_entities": 4,
        "by_type": {
            "PERSON": 2,
            "ORG": 1,
            "GPE": 2
        }
    }
}
```

## Filtering and Grouping

### Filter by Type

```python
entities = extractor.extract(text)

# Get only people and organizations
filtered = extractor.filter_entities(entities, ["PERSON", "ORG"])
```

### Get Unique Entities

```python
# Remove duplicates, keep first occurrence
unique = extractor.get_unique_entities(entities)
```

### Group by Type

```python
grouped = extractor.group_by_type(entities)

# Returns:
{
    "PERSON": ["Steve Jobs", "Tim Cook"],
    "ORG": ["Apple Inc."],
    "GPE": ["California", "Cupertino"]
}
```

## Entity Frequency

```python
frequency = extractor.entity_frequency(text)

# Returns:
{
    "Steve Jobs": {"count": 5, "type": "PERSON"},
    "Apple": {"count": 8, "type": "ORG"},
    "California": {"count": 2, "type": "GPE"}
}
```

## Batch Processing

### Process Folder

```python
results = extractor.extract_batch("./documents/")

# Returns:
{
    "doc1.txt": {
        "entities": [...],
        "summary": {...}
    },
    "doc2.txt": {
        "entities": [...],
        "summary": {...}
    }
}
```

### Export to CSV

```python
extractor.to_csv(results, "entities.csv")

# Creates CSV with columns:
# filename, entity_text, entity_type, start, end
```

## Text Highlighting

Generate HTML with highlighted entities:

```python
html = extractor.highlight_text(text)

# Returns HTML with colored spans for each entity type
```

## Example Workflows

### Document Analysis

```python
extractor = EntityExtractor()

# Analyze a document
text = open("article.txt").read()
result = extractor.extract(text)

# Get key people mentioned
people = extractor.filter_entities(result, ["PERSON"])
print(f"People mentioned: {len(people)}")

# Get frequency
freq = extractor.entity_frequency(text)
top_entities = sorted(freq.items(), key=lambda x: x[1]["count"], reverse=True)[:10]
```

### Contact Information Extraction

```python
extractor = EntityExtractor(mode="regex")

text = """
Contact John Smith at [email protected]
or call (555) 123-4567.
"""

entities = extractor.extract(text)
# Finds: EMAIL, PHONE entities
```

### Content Tagging

```python
extractor = EntityExtractor()

articles = ["article1.txt", "article2.txt", "article3.txt"]
tags = {}

for article in articles:
    entities = extractor.extract_file(article)
    tags[article] = extractor.get_unique_entities(entities)
```

## Dependencies

- spacy>=3.7.0
- pandas>=2.0.0
- en_core_web_sm (spaCy model)

Note: Run `python -m spacy download en_core_web_sm` to install the model.

Overview

This skill extracts named entities (people, organizations, locations, dates, money, percentages, and more) from plain text using a mix of NLP and regex. It supports spaCy-backed high-accuracy extraction and a faster regex mode, plus batch processing, entity linking, exports, and HTML highlighting.

How this skill works

The extractor runs spaCy models to detect standard entity types and falls back to regex patterns for fast extraction of emails, phones, URLs, and common date or money formats. It returns structured entity objects with text, type, character offsets, and confidence. Additional utilities group, filter, deduplicate, and summarize entities across single documents or batches, and export results to JSON or CSV.

When to use it

  • Document analysis to identify key people, organizations, places, and events
  • Information extraction for data enrichment or building knowledge graphs
  • Content tagging and metadata generation for search and CMS workflows
  • Batch processing of folders of documents for compliance or research
  • Quick contact extraction (emails, phones, URLs) using regex mode

Best practices

  • Use spaCy mode (default) for production accuracy; use regex mode for high-throughput or when spaCy isn’t available
  • Pre-clean text (remove boilerplate or OCR noise) to improve entity precision and reduce false positives
  • Filter by types after extraction to focus downstream processing on relevant entities
  • Run get_unique_entities before exporting to reduce duplicates and produce concise lists
  • Tune the spaCy model (larger models) or add custom entity patterns for domain-specific terms

Example use cases

  • Analyze news articles to surface frequently mentioned people, companies, and locations
  • Enrich CRM records by extracting contact details and organization names from documents
  • Batch-extract entities from research papers to build an index of authors, institutions, and events
  • Highlight entities in HTML for a content review dashboard or editorial tool
  • Export entity summaries to CSV or JSON for downstream analytics and reporting

FAQ

Which entity types are supported?

Standard spaCy types (PERSON, ORG, GPE, LOC, DATE, TIME, MONEY, PERCENT, PRODUCT, EVENT, WORK_OF_ART, LAW, LANGUAGE, NORP) plus regex-detected EMAIL, PHONE, URL, DATE, MONEY, PERCENTAGE.

How do I choose between spaCy and regex modes?

Use spaCy for accuracy and relationship extraction. Use regex when you need speed, lightweight deployments, or to extract structured contact info like emails and phones.