home / skills / benchflow-ai / skillsbench / filler-word-processing

filler-word-processing skill

safe

/tasks/video-filler-word-remover/environment/skills/filler-word-processing

This skill analyzes filler word annotations to generate precise video edit segments for removing disfluencies and improving narration flow.

npx playbooks add skill benchflow-ai/skillsbench --skill filler-word-processing

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

4.1 KB

---
name: filler-word-processing
description: "Process filler word annotations to generate video edit lists. Use when working with timestamp annotations for removing speech disfluencies (um, uh, like, you know) from audio/video content."
---

# Filler Word Processing

## Annotation Format

Typical annotation JSON structure:
```json
[
  {"word": "um", "timestamp": 12.5},
  {"word": "like", "timestamp": 25.3},
  {"word": "you know", "timestamp": 45.8}
]
```

## Converting Annotations to Cut Segments

Each filler word annotation marks when the word starts. To remove it, use word-specific durations since different fillers have different lengths:

```python
import json

# Word-specific durations (in seconds)
WORD_DURATIONS = {
    "uh": 0.3,
    "um": 0.4,
    "hum": 0.6,
    "hmm": 0.6,
    "mhm": 0.55,
    "like": 0.3,
    "yeah": 0.35,
    "so": 0.25,
    "well": 0.35,
    "okay": 0.4,
    "basically": 0.55,
    "you know": 0.55,
    "i mean": 0.5,
    "kind of": 0.5,
    "i guess": 0.5,
}
DEFAULT_DURATION = 0.4

def annotations_to_segments(annotations_file, buffer=0.05):
    """
    Convert filler word annotations to (start, end) cut segments.

    Args:
        annotations_file: Path to JSON annotations
        buffer: Small buffer before the word (seconds)

    Returns:
        List of (start, end) tuples representing segments to remove
    """
    with open(annotations_file) as f:
        annotations = json.load(f)

    segments = []
    for ann in annotations:
        word = ann.get('word', '').lower().strip()
        timestamp = ann['timestamp']
        # Use word-specific duration, fall back to default
        word_duration = WORD_DURATIONS.get(word, DEFAULT_DURATION)
        # Cut starts slightly before the word
        start = max(0, timestamp - buffer)
        # Cut ends after word duration
        end = timestamp + word_duration
        segments.append((start, end))

    return segments
```

## Merging Overlapping Segments

When filler words are close together, merge their cut segments:

```python
def merge_overlapping_segments(segments, min_gap=0.1):
    """
    Merge segments that overlap or are very close together.

    Args:
        segments: List of (start, end) tuples
        min_gap: Minimum gap to keep segments separate

    Returns:
        Merged list of segments
    """
    if not segments:
        return []

    # Sort by start time
    sorted_segs = sorted(segments)
    merged = [sorted_segs[0]]

    for start, end in sorted_segs[1:]:
        prev_start, prev_end = merged[-1]

        # If this segment overlaps or is very close to previous
        if start <= prev_end + min_gap:
            # Extend the previous segment
            merged[-1] = (prev_start, max(prev_end, end))
        else:
            merged.append((start, end))

    return merged
```

## Complete Processing Pipeline

```python
def process_filler_annotations(annotations_file, word_duration=0.4):
    """Full pipeline: load annotations -> create segments -> merge overlaps"""

    # Load and create initial segments
    segments = annotations_to_segments(annotations_file, word_duration)

    # Merge overlapping cuts
    merged = merge_overlapping_segments(segments)

    return merged
```

## Tuning Parameters

| Parameter | Typical Value | Notes |
|-----------|---------------|-------|
| word_duration | varies | Short fillers (um, uh) ~0.25-0.3s, single words (like, yeah) ~0.3-0.4s, phrases (you know, i mean) ~0.5-0.6s |
| buffer | 0.05s | Small buffer captures word onset |
| min_gap | 0.1s | Prevents micro-segments between close fillers |

### Word Duration Guidelines

| Category | Words | Duration |
|----------|-------|----------|
| Quick hesitations | uh, um | 0.3-0.4s |
| Sustained hums (drawn out while thinking) | hum, hmm, mhm | 0.55-0.6s |
| Quick single words | like, yeah, so, well | 0.25-0.35s |
| Longer single words | okay, basically | 0.4-0.55s |
| Multi-word phrases | you know, i mean, kind of, i guess | 0.5-0.55s |

## Quality Considerations

- **Too aggressive**: Cuts into adjacent words, sounds choppy
- **Too conservative**: Filler words partially audible
- **Sweet spot**: Clean cuts with natural-sounding result

Test with a few samples before processing full video.

Overview

This skill processes timestamped filler-word annotations to generate precise edit lists for removing speech disfluencies (um, uh, like, you know) from audio or video. It converts annotations into cut segments using word-specific durations, merges nearby cuts to avoid micro-edits, and outputs merged segments ready for batch editing tools.

How this skill works

The skill reads a JSON list of filler annotations (word + timestamp), maps each word to a duration (with a sensible default), and creates cut segments that start slightly before the detected onset and end after the estimated filler length. It then merges overlapping or very close segments using a configurable minimum gap so the resulting edit list produces natural-sounding transitions.

When to use it

Cleaning recorded interviews, podcasts, or lectures to remove filler words.
Pre-processing raw video for polished social clips and highlights.
Automating batch edits when you have timestamped filler annotations from speech recognition or manual tagging.
Reducing manual editing time when many short disfluencies are present.

Best practices

Use word-specific durations rather than a single fixed length to avoid over- or under-cutting.
Keep a small buffer (e.g., 0.05s) before the timestamp to capture onset without chopping preceding words.
Merge segments with a min_gap (e.g., 0.1s) to prevent rapid back-to-back cuts that sound unnatural.
Test settings on short samples: evaluate aggressive vs conservative durations and adjust per speaker/style.
When available, refine durations for speakers with faster or slower filler pronunciations.

Example use cases

Convert filler annotations from an automatic speech recognizer into (start,end) cut lists for an NLE or FFmpeg batch script.
Post-process manual timestamps from a transcript editor to create merged cuts for podcast episodes.
Automatically remove clustered fillers in a live-recorded talk, merging overlapping segments to preserve flow.
Create a QA step that runs small sample edits to confirm chosen durations and min_gap before applying to full footage.

FAQ

What if a filler word annotation is inaccurate by a few tenths of a second?

Use a small buffer before the timestamp and slightly longer word durations; merging nearby segments also helps cover misaligned detections.

How do I avoid cutting adjacent words and making audio choppy?

Reduce the duration and buffer values, increase min_gap to avoid merging separate words, and test on samples to find the sweet spot between aggressive and conservative cuts.