home / skills / benchflow-ai / skillsbench / filler-word-processing
/tasks/video-filler-word-remover/environment/skills/filler-word-processing
This skill analyzes filler word annotations to generate precise video edit segments for removing disfluencies and improving narration flow.
npx playbooks add skill benchflow-ai/skillsbench --skill filler-word-processingReview the files below or copy the command above to add this skill to your agents.
---
name: filler-word-processing
description: "Process filler word annotations to generate video edit lists. Use when working with timestamp annotations for removing speech disfluencies (um, uh, like, you know) from audio/video content."
---
# Filler Word Processing
## Annotation Format
Typical annotation JSON structure:
```json
[
{"word": "um", "timestamp": 12.5},
{"word": "like", "timestamp": 25.3},
{"word": "you know", "timestamp": 45.8}
]
```
## Converting Annotations to Cut Segments
Each filler word annotation marks when the word starts. To remove it, use word-specific durations since different fillers have different lengths:
```python
import json
# Word-specific durations (in seconds)
WORD_DURATIONS = {
"uh": 0.3,
"um": 0.4,
"hum": 0.6,
"hmm": 0.6,
"mhm": 0.55,
"like": 0.3,
"yeah": 0.35,
"so": 0.25,
"well": 0.35,
"okay": 0.4,
"basically": 0.55,
"you know": 0.55,
"i mean": 0.5,
"kind of": 0.5,
"i guess": 0.5,
}
DEFAULT_DURATION = 0.4
def annotations_to_segments(annotations_file, buffer=0.05):
"""
Convert filler word annotations to (start, end) cut segments.
Args:
annotations_file: Path to JSON annotations
buffer: Small buffer before the word (seconds)
Returns:
List of (start, end) tuples representing segments to remove
"""
with open(annotations_file) as f:
annotations = json.load(f)
segments = []
for ann in annotations:
word = ann.get('word', '').lower().strip()
timestamp = ann['timestamp']
# Use word-specific duration, fall back to default
word_duration = WORD_DURATIONS.get(word, DEFAULT_DURATION)
# Cut starts slightly before the word
start = max(0, timestamp - buffer)
# Cut ends after word duration
end = timestamp + word_duration
segments.append((start, end))
return segments
```
## Merging Overlapping Segments
When filler words are close together, merge their cut segments:
```python
def merge_overlapping_segments(segments, min_gap=0.1):
"""
Merge segments that overlap or are very close together.
Args:
segments: List of (start, end) tuples
min_gap: Minimum gap to keep segments separate
Returns:
Merged list of segments
"""
if not segments:
return []
# Sort by start time
sorted_segs = sorted(segments)
merged = [sorted_segs[0]]
for start, end in sorted_segs[1:]:
prev_start, prev_end = merged[-1]
# If this segment overlaps or is very close to previous
if start <= prev_end + min_gap:
# Extend the previous segment
merged[-1] = (prev_start, max(prev_end, end))
else:
merged.append((start, end))
return merged
```
## Complete Processing Pipeline
```python
def process_filler_annotations(annotations_file, word_duration=0.4):
"""Full pipeline: load annotations -> create segments -> merge overlaps"""
# Load and create initial segments
segments = annotations_to_segments(annotations_file, word_duration)
# Merge overlapping cuts
merged = merge_overlapping_segments(segments)
return merged
```
## Tuning Parameters
| Parameter | Typical Value | Notes |
|-----------|---------------|-------|
| word_duration | varies | Short fillers (um, uh) ~0.25-0.3s, single words (like, yeah) ~0.3-0.4s, phrases (you know, i mean) ~0.5-0.6s |
| buffer | 0.05s | Small buffer captures word onset |
| min_gap | 0.1s | Prevents micro-segments between close fillers |
### Word Duration Guidelines
| Category | Words | Duration |
|----------|-------|----------|
| Quick hesitations | uh, um | 0.3-0.4s |
| Sustained hums (drawn out while thinking) | hum, hmm, mhm | 0.55-0.6s |
| Quick single words | like, yeah, so, well | 0.25-0.35s |
| Longer single words | okay, basically | 0.4-0.55s |
| Multi-word phrases | you know, i mean, kind of, i guess | 0.5-0.55s |
## Quality Considerations
- **Too aggressive**: Cuts into adjacent words, sounds choppy
- **Too conservative**: Filler words partially audible
- **Sweet spot**: Clean cuts with natural-sounding result
Test with a few samples before processing full video.
This skill processes timestamped filler-word annotations to generate precise edit lists for removing speech disfluencies (um, uh, like, you know) from audio or video. It converts annotations into cut segments using word-specific durations, merges nearby cuts to avoid micro-edits, and outputs merged segments ready for batch editing tools.
The skill reads a JSON list of filler annotations (word + timestamp), maps each word to a duration (with a sensible default), and creates cut segments that start slightly before the detected onset and end after the estimated filler length. It then merges overlapping or very close segments using a configurable minimum gap so the resulting edit list produces natural-sounding transitions.
What if a filler word annotation is inaccurate by a few tenths of a second?
Use a small buffer before the timestamp and slightly longer word durations; merging nearby segments also helps cover misaligned detections.
How do I avoid cutting adjacent words and making audio choppy?
Reduce the duration and buffer values, increase min_gap to avoid merging separate words, and test on samples to find the sweet spot between aggressive and conservative cuts.