home / skills / yousufjoyian / claude-skills / audio-transcriber
This skill transcribes dashcam video audio with GPU acceleration, producing timestamped transcripts and optional speaker diarization for quick analysis.
npx playbooks add skill yousufjoyian/claude-skills --skill audio-transcriberReview the files below or copy the command above to add this skill to your agents.
---
name: audio-transcriber
description: Extracts audio from dashcam MP4 files and produces GPU-accelerated timestamped transcripts with optional speaker diarization. This skill should be used when users request audio transcription from video files, mention dashcam audio/transcribe MP4/extract speech, want to analyze conversations from video footage, need timestamped transcripts with speaker identification, or ask to process video folders with audio extraction.
---
# Audio Transcriber
**Skill Type:** Media Processing & Analysis
**Domain:** Audio Transcription, Speech Recognition, GPU Acceleration
**Version:** 2.0
**Last Updated:** 2025-10-26
---
## Description
Extracts audio from dashcam MP4 files and produces GPU-accelerated timestamped transcripts with optional speaker diarization. Uses faster-whisper with CUDA for efficient processing, organizing outputs by date with comprehensive metadata and quality metrics.
**When to Use This Skill:**
- User requests audio transcription from video files
- User mentions "dashcam audio", "transcribe MP4", or "extract speech"
- User wants to analyze conversations from video footage
- User needs timestamped transcripts with speaker identification
- User asks to process video folders with audio extraction
---
## Quick Start
### User Trigger Phrases
- "Transcribe audio from my dashcam videos"
- "Extract and transcribe speech from [folder/date]"
- "Generate transcripts for [MP4 files/date range]"
- "Process dashcam audio with speaker identification"
- "Create subtitles from video files"
### Expected Inputs
1. **Video Folder Path** (required) - Path to MP4 files or date-organized folders
2. **Date Range** (optional) - Single day, range, or "all available"
3. **Output Directory** (optional) - Default: parallel to input with `_transcripts` suffix
4. **Processing Options** (optional) - Model size, formats, diarization, GPU settings
### Expected Outputs
- Audio extracts (WAV files) organized by date
- Transcripts in multiple formats (TXT, JSON, SRT, VTT)
- Global INDEX.csv with searchable segment metadata
- Results JSON with GPU metrics and processing statistics
- Quality reports with confidence scores and coverage
---
## Core Capabilities
### 1. User Input Acquisition (Section 0 Protocol)
**CRITICAL: Always follow the Section 0 protocol before processing.**
#### Step 1: Parse User Request
```
User: "Transcribe my dashcam videos from September 3rd"
AI Detection:
ā
Date detected: 2025-09-03
ā Folder path not specified ā Try auto-discovery
```
#### Step 2: Auto-Discovery
Search these locations in order:
1. `C:\Users\[user]\Desktop\CARDV\Movie_F\YYYYMMDD\*.MP4`
2. `G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Movie_F\YYYYMMDD\*.MP4`
3. Current working directory + `\*.mp4`
4. User-provided explicit path
#### Step 3: Validate Files
```python
def validate_video_folder(folder_path):
# Check folder exists
# Count MP4 files
# Estimate total audio duration
# Check disk space for outputs
# Verify FFmpeg installation
```
#### Step 4: Prompt for Missing Inputs
If files not found:
```
š„ I need video files to transcribe. I couldn't find MP4s in the expected location.
Please provide ONE of the following:
1. **Direct folder path**: e.g., `C:\Users\yousu\Desktop\CARDV\Movie_F\20250903`
2. **Parent directory**: I'll search for date subfolders
3. **Single video file**: Process just one MP4
š” Tip: Dashcam folders are usually organized by date (YYYYMMDD)
```
#### Step 5: Configuration Summary & Confirmation
```
š **Transcription Configuration Summary**
**Input:**
š Folder: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903
š¹ Videos found: 47 MP4 files
ā±ļø Estimated audio: ~7.8 hours
š¾ Estimated output size: ~450 MB (transcripts only)
**Processing:**
š„ļø GPU: NVIDIA GeForce RTX 4080 (detected)
š§ Model: faster-whisper base (FP16, CUDA)
šÆ Segmentation: fixed 30s chunks
š£ļø Diarization: disabled (opt-in)
š Formats: txt, json, srt
**Output:**
š¾ Audio extracts: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\audio\
š Transcripts: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\transcripts\
š INDEX.csv: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\transcripts\INDEX.csv
Ready to proceed? (Yes/No)
```
**NEVER begin processing without user confirmation.**
---
### 2. Audio Processing Pipeline
#### A. Audio Extraction (FFmpeg with Retry Matrix)
```python
# Primary extraction command
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
# Retry sequence on failure:
# 1. Codec fallback: pcm_s16le ā flac
# 2. Add demuxer args: -fflags +genpts -rw_timeout 30000000
# 3. Extended probe: -analyzeduration 100M -probesize 100M
```
**Quality Checks:**
- Verify audio stream exists (ffprobe preflight)
- Check duration matches video duration
- Detect silent/corrupted audio
- Log extraction errors to `_FAILED.json`
#### B. Segmentation (Two Modes)
**Fixed Mode (Default):**
- Split audio into 30-second chunks
- Predictable processing time
- No external VAD required
- Best for continuous speech
**VAD Mode (Advanced):**
- Use Silero VAD to detect speech regions
- Variable-length segments (2-60s)
- Skip long silences
- Best for sparse audio (parking mode)
**Mutual Exclusion:** Only one mode active at a time.
#### C. GPU Transcription (faster-whisper)
```python
# Load model with GPU optimization
model = WhisperModel(
"base",
device="cuda",
compute_type="float16"
)
# Transcribe with word-level timestamps
segments, info = model.transcribe(
audio_path,
beam_size=5,
word_timestamps=True,
vad_filter=True
)
```
**GPU Metrics Captured:**
- Device name, VRAM, utilization
- CUDA version, driver version
- Average GPU % during run (sampled at 1-2 Hz)
- Memory usage peaks
#### D. Speaker Diarization (Optional)
**Backends:**
- **pyannote**: State-of-the-art (requires HF token + VRAM)
- **speechbrain**: Good performance (no auth required)
**Label Normalization:**
- Different backends ā unified `spkA`, `spkB`, etc.
- Consistent across INDEX.csv and JSON outputs
**Fallback Behavior:**
- If HF token missing ā skip diarization, log warning
- If OOM error ā disable diarization, continue transcription
---
### 3. Output Generation
#### A. File Organization (Per-Day Structure)
```
C:\Users\yousu\Desktop\CARDV\Movie_F\
āāā 20250903\
āāā audio\
ā āāā 20250903133516_059495B.wav
ā āāā 20250903134120_059496B.wav
ā āāā ... (47 files)
āāā transcripts\
ā āāā 20250903133516_059495B.txt
ā āāā 20250903133516_059495B.json
ā āāā 20250903133516_059495B.srt
ā āāā ... (47 Ć 3 = 141 files)
āāā INDEX.csv
```
#### B. Format Details
**TXT (Plain Text):**
```
[00:00:15] Speaker A: Hey, where are we going?
[00:00:18] Speaker B: Just heading to the mall.
[00:00:22] Speaker A: Okay, sounds good.
```
**JSON (Complete Metadata):**
```json
{
"video_file": "20250903133516_059495B.MP4",
"audio_duration_sec": 60,
"language": "en",
"language_confidence": 0.95,
"segments": [
{
"start": 15.2,
"end": 17.8,
"text": "Hey, where are we going?",
"confidence": 0.89,
"speaker": "spkA",
"words": [
{"word": "Hey", "start": 15.2, "end": 15.4, "confidence": 0.92},
{"word": "where", "start": 15.5, "end": 15.8, "confidence": 0.88}
]
}
]
}
```
**SRT (SubRip Subtitles):**
```
1
00:00:15,200 --> 00:00:17,800
[spkA] Hey, where are we going?
2
00:00:18,000 --> 00:00:20,500
[spkB] Just heading to the mall.
```
**VTT (WebVTT):**
```
WEBVTT
00:00:15.200 --> 00:00:17.800
<v spkA>Hey, where are we going?
00:00:18.000 --> 00:00:20.500
<v spkB>Just heading to the mall.
```
#### C. INDEX.csv (Global Search Index)
Composite key: `(video_rel, seg_idx)`
| Column | Description |
|--------|-------------|
| `dataset` | Movie_F / Movie_R / Park_F / Park_R |
| `date` | YYYYMMDD |
| `video_rel` | Relative path from root |
| `video_stem` | Filename without extension |
| `seg_idx` | 0-based segment index |
| `ts_start_ms` | Segment start milliseconds |
| `ts_end_ms` | Segment end milliseconds |
| `text` | Transcript text (truncated to 512 chars) |
| `text_len` | Full text length |
| `lang` | ISO language code |
| `lang_conf` | Language detection confidence |
| `conf_avg` | Average token confidence |
| `speaker` | Normalized speaker label |
| `format_mask` | Files generated (txt/json/srt/vtt) |
| `transcript_file` | Basename |
| `audio_file` | Basename |
| `engine` | e.g., `faster-whisper:base:fp16` |
| `cuda_version` | CUDA version |
| `driver_version` | Driver version |
| `created_utc` | ISO 8601 timestamp |
#### D. Results JSON (Single Source of Truth)
```json
{
"status": "ok",
"summary": {
"videos_processed": 47,
"segments": 1847,
"hours_audio": 7.8,
"gpu_detected": true,
"device_count": 1,
"devices": [
{
"index": 0,
"name": "NVIDIA GeForce RTX 4080",
"total_mem_mb": 16384,
"free_mem_mb": 14200
}
],
"utilization": {
"gpu_pct": 35,
"mem_pct": 42,
"sampling_hz": 2
},
"cuda_version": "12.1",
"driver_version": "546.01",
"torch_version": "2.2.0+cu121",
"errors": 0,
"failed_files": []
},
"artifacts": {
"index_csv": "C:\\Users\\yousu\\Desktop\\CARDV\\Movie_F\\20250903\\INDEX.csv",
"output_dir": "C:\\Users\\yousu\\Desktop\\CARDV\\Movie_F\\20250903\\transcripts"
}
}
```
---
### 4. Quality & Error Handling
#### A. Resume Safety
- Skip existing transcripts unless `--force` flag
- Idempotent: re-running is safe
- Checkpoint support for long runs
#### B. Error Types & Recovery
**Per-Video Failures** (`{video_stem}_FAILED.json`):
```json
{
"video_path": "C:\\...\\video.mp4",
"error_type": "ffmpeg_err",
"error_message": "Failed to decode audio stream",
"ffprobe_metadata": {"duration": null, "codec": "h264"},
"timestamp": "2025-09-03T14:30:00Z"
}
```
Error types:
- `ffmpeg_err`: Audio extraction failed
- `decode_err`: Whisper decode failed
- `OOM`: Out of GPU memory
- `corrupted`: Container/stream corrupted
- `no_audio`: No audio stream detected
#### C. SRT/VTT Validation
- Strictly monotonic timestamps
- No overlapping segments
- Clamp gaps <50ms
- Proper timecode formatting (comma vs period)
---
## Implementation Guide
### Phase 1: Input Acquisition
```python
# 1. Parse user request
inputs = parse_user_request(user_message)
# 2. Auto-discover video files
if not inputs['video_folder']:
inputs['video_folder'] = auto_discover_videos()
# 3. Validate inputs
validate_video_folder(inputs['video_folder'])
check_ffmpeg_available()
check_gpu_available()
# 4. Estimate resource requirements
estimate_processing_time(inputs)
estimate_disk_space(inputs)
# 5. Present configuration summary
show_configuration_summary(inputs)
# 6. Wait for confirmation
if not user_confirms():
return # Do not proceed
```
### Phase 2: Audio Extraction
```python
for video_file in video_files:
# FFprobe preflight check
metadata = ffprobe(video_file)
if not has_audio_stream(metadata):
log_failed(video_file, "no_audio")
continue
# Extract audio with retry
try:
audio_path = extract_audio_ffmpeg(
video_file,
output_dir=audio_output_dir,
sample_rate=16000,
channels=1
)
except FFmpegError as e:
# Retry with fallback codec
audio_path = extract_audio_ffmpeg_retry(video_file)
```
### Phase 3: Transcription
```python
# Load model once (reuse for all files)
model = load_whisper_model(
model_size="base",
device="cuda",
compute_type="float16"
)
for audio_file in audio_files:
# Segment audio
if segmentation_mode == "fixed":
chunks = segment_fixed(audio_file, chunk_size=30)
else:
chunks = segment_vad(audio_file, vad_model)
# Transcribe each chunk
all_segments = []
for chunk in chunks:
segments = model.transcribe(chunk)
all_segments.extend(segments)
# Optional: Diarization
if diarization_enabled:
all_segments = apply_diarization(audio_file, all_segments)
```
### Phase 4: Output Generation
```python
# Generate all formats
for video_file, segments in results.items():
# TXT
write_txt(segments, output_dir)
# JSON
write_json(segments, metadata, output_dir)
# SRT
srt_content = generate_srt(segments)
validate_srt_monotonic(srt_content)
write_srt(srt_content, output_dir)
# VTT (optional)
write_vtt(segments, output_dir)
# Update INDEX.csv
append_to_index(segments, index_csv_path)
```
### Phase 5: Completion Report
```python
# Generate results JSON
results_json = {
"status": "ok",
"summary": collect_statistics(),
"artifacts": list_output_files(),
"gpu_metrics": get_gpu_metrics()
}
# Save to file
save_results_json(results_json, output_dir)
# Report to user
print(f"ā
Complete! Processed {video_count} videos")
print(f" Transcripts: {output_dir}")
print(f" INDEX: {index_csv_path}")
print(f" GPU Util: {avg_gpu_pct}%")
```
---
## Reference Materials
### In This Skill
- **SKILL_MANIFEST.md** - Complete technical specification (v2.0)
- **references/TECHNICAL_SPECIFICATION.md** - Detailed implementation rules
- **scripts/batch_transcriber.py** - Main batch processing script
- **scripts/audio_extractor.py** - FFmpeg wrapper with retry logic
- **scripts/transcriber.py** - Whisper transcription engine
- **scripts/diarizer.py** - Speaker diarization integration
- **scripts/format_writers.py** - TXT/JSON/SRT/VTT generators
- **scripts/gpu_monitor.py** - GPU metrics collection
- **scripts/validation.py** - Input validation and checks
- **assets/config_template.json** - Default configuration
- **assets/params.json** - Tunable parameters
### External Documentation
- faster-whisper documentation
- FFmpeg audio processing guide
- pyannote.audio diarization guide
- SRT/VTT subtitle format specifications
---
## Tunable Parameters
```json
{
"whisper": {
"model_size": "base",
"device": "cuda",
"compute_type": "float16",
"batch_size": 8,
"beam_size": 5,
"language": "en",
"detect_language": false
},
"audio": {
"sample_rate": 16000,
"channels": 1,
"format": "wav",
"keep_intermediate": false
},
"segmentation": {
"mode": "fixed",
"chunk_length_sec": 30,
"vad_min_len_sec": 2,
"vad_max_len_sec": 60
},
"diarization": {
"enabled": false,
"backend": "pyannote",
"min_speakers": 1,
"max_speakers": 10
},
"output": {
"formats": ["txt", "json", "srt"],
"text_truncate_csv": 512
},
"parallel": {
"max_workers": 3
}
}
```
---
## Common Issues & Solutions
### Issue 1: "GPU not detected"
**Cause:** CUDA not installed or incompatible driver
**Solution:**
1. Check: `python -c "import torch; print(torch.cuda.is_available())"`
2. Install/update CUDA toolkit
3. Fallback to CPU: `--device cpu`
### Issue 2: "FFmpeg command failed"
**Cause:** FFmpeg not in PATH or unsupported codec
**Solution:**
1. Verify: `ffmpeg -version`
2. Install from ffmpeg.org
3. Use retry matrix with codec fallback
### Issue 3: "Out of memory (OOM)"
**Cause:** GPU VRAM insufficient for model + batch size
**Solution:**
1. Use smaller model: `tiny` or `small`
2. Reduce batch size: `--batch 4`
3. Process fewer files in parallel
### Issue 4: "Diarization failed"
**Cause:** HF token missing or network error
**Solution:**
1. Set token: `export HF_TOKEN=hf_...`
2. Accept pyannote license on HuggingFace
3. Disable diarization: `--no-diarize`
### Issue 5: "SRT validation errors"
**Cause:** Overlapping timestamps or malformed timecodes
**Solution:**
1. Enable timestamp clamping: `--clamp-gaps`
2. Check for negative durations
3. Validate with subtitle validator tool
---
## Security & Privacy Notes
### Data Sensitivity
Dashcam audio may contain:
- Personal conversations
- Addresses and locations
- Phone numbers and names
- Private information
### Processing Guidelines
1. **Local Processing Only** - Never upload audio to external services
2. **Secure Storage** - Encrypt transcripts if sharing devices
3. **Redaction** - Use `--redact` flag for PII patterns (phone, email)
4. **Retention** - Delete audio extracts after transcription if not needed
### Investigation Use
- Designed for legitimate personal data analysis
- NOT an anti-forensics tool
- All conclusions require independent corroboration
---
## Skill Invocation
This skill is invoked when the model detects:
1. User mentions "transcribe audio", "dashcam transcription", or "extract speech"
2. User requests processing of video/MP4 files for audio content
3. User provides paths to video folders
4. User asks for subtitles or timestamped transcripts
---
## Success Criteria
A successful audio transcription must:
ā
Obtain all required inputs from user (video folder, output preferences)
ā
Validate all inputs before processing (files exist, FFmpeg available, GPU detected)
ā
Present configuration summary and get confirmation
ā
Extract audio successfully (or log failures)
ā
Transcribe with GPU acceleration (or CPU fallback)
ā
Generate all requested formats (TXT, JSON, SRT, VTT)
ā
Create INDEX.csv with searchable metadata
ā
Include GPU metrics in results JSON
ā
Report output locations to user
**Key Principle:** Never guess critical inputs. Always validate, confirm, and provide clear feedback.
---
## Version History
- **v2.0** (2025-10-26) - Production-ready skill with GPU-first architecture
- **v1.5** (2025-10-25) - Added diarization support and retry matrix
- **v1.0** (2025-10-20) - Initial release with basic transcription
---
**Last Updated:** 2025-10-26
**Status:** Production Ready
**Maintained By:** Audio Transcription Pipeline Project
This skill extracts audio from dashcam MP4 files and produces GPU-accelerated, timestamped transcripts with optional speaker diarization. It organizes outputs by date, generates multi-format artifacts (TXT/JSON/SRT/VTT), and produces an INDEX.csv plus a results summary with GPU and quality metrics. Designed for reliable batch processing of date-organized video folders and single files.
The skill auto-discovers or accepts a user-specified folder of MP4s, validates files and environment (FFmpeg, GPU), then extracts audio via FFmpeg with a retry matrix. Audio is segmented (fixed chunks or VAD-based), transcribed on GPU using faster-whisper with word timestamps, and optionally passed to a diarization backend (pyannote or speechbrain). Outputs and metadata are written per-day with error handling, resume safety, and a global INDEX.csv and results JSON containing GPU metrics and processing statistics.
Do you start processing immediately after the user request?
No. I always present a configuration summary and require explicit user confirmation before any processing begins.
What happens if a video has no audio or extraction fails?
The tool logs a {video_stem}_FAILED.json with error_type (e.g., no_audio, ffmpeg_err) and continues; failed files are listed in the final results JSON.
Is a GPU required?
A GPU is recommended for speed and is used when available; CPU-only runs are possible but much slower. GPU metrics are captured in the results JSON.