home / skills / yousufjoyian / claude-skills / audio-transcriber

Audio-Transcriber skill

safe

This skill transcribes dashcam video audio with GPU acceleration, producing timestamped transcripts and optional speaker diarization for quick analysis.

npx playbooks add skill yousufjoyian/claude-skills --skill audio-transcriber

Review the files below or copy the command above to add this skill to your agents.

Files (25)

SKILL.md

18.2 KB

---
name: audio-transcriber
description: Extracts audio from dashcam MP4 files and produces GPU-accelerated timestamped transcripts with optional speaker diarization. This skill should be used when users request audio transcription from video files, mention dashcam audio/transcribe MP4/extract speech, want to analyze conversations from video footage, need timestamped transcripts with speaker identification, or ask to process video folders with audio extraction.
---

# Audio Transcriber

**Skill Type:** Media Processing & Analysis
**Domain:** Audio Transcription, Speech Recognition, GPU Acceleration
**Version:** 2.0
**Last Updated:** 2025-10-26

---

## Description

Extracts audio from dashcam MP4 files and produces GPU-accelerated timestamped transcripts with optional speaker diarization. Uses faster-whisper with CUDA for efficient processing, organizing outputs by date with comprehensive metadata and quality metrics.

**When to Use This Skill:**
- User requests audio transcription from video files
- User mentions "dashcam audio", "transcribe MP4", or "extract speech"
- User wants to analyze conversations from video footage
- User needs timestamped transcripts with speaker identification
- User asks to process video folders with audio extraction

---

## Quick Start

### User Trigger Phrases
- "Transcribe audio from my dashcam videos"
- "Extract and transcribe speech from [folder/date]"
- "Generate transcripts for [MP4 files/date range]"
- "Process dashcam audio with speaker identification"
- "Create subtitles from video files"

### Expected Inputs
1. **Video Folder Path** (required) - Path to MP4 files or date-organized folders
2. **Date Range** (optional) - Single day, range, or "all available"
3. **Output Directory** (optional) - Default: parallel to input with `_transcripts` suffix
4. **Processing Options** (optional) - Model size, formats, diarization, GPU settings

### Expected Outputs
- Audio extracts (WAV files) organized by date
- Transcripts in multiple formats (TXT, JSON, SRT, VTT)
- Global INDEX.csv with searchable segment metadata
- Results JSON with GPU metrics and processing statistics
- Quality reports with confidence scores and coverage

---

## Core Capabilities

### 1. User Input Acquisition (Section 0 Protocol)

**CRITICAL: Always follow the Section 0 protocol before processing.**

#### Step 1: Parse User Request
```
User: "Transcribe my dashcam videos from September 3rd"

AI Detection:
✅ Date detected: 2025-09-03
❌ Folder path not specified → Try auto-discovery
```

#### Step 2: Auto-Discovery
Search these locations in order:
1. `C:\Users\[user]\Desktop\CARDV\Movie_F\YYYYMMDD\*.MP4`
2. `G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Movie_F\YYYYMMDD\*.MP4`
3. Current working directory + `\*.mp4`
4. User-provided explicit path

#### Step 3: Validate Files
```python
def validate_video_folder(folder_path):
    # Check folder exists
    # Count MP4 files
    # Estimate total audio duration
    # Check disk space for outputs
    # Verify FFmpeg installation
```

#### Step 4: Prompt for Missing Inputs
If files not found:
```
🎥 I need video files to transcribe. I couldn't find MP4s in the expected location.

Please provide ONE of the following:
1. **Direct folder path**: e.g., `C:\Users\yousu\Desktop\CARDV\Movie_F\20250903`
2. **Parent directory**: I'll search for date subfolders
3. **Single video file**: Process just one MP4

💡 Tip: Dashcam folders are usually organized by date (YYYYMMDD)
```

#### Step 5: Configuration Summary & Confirmation
```
📋 **Transcription Configuration Summary**

**Input:**
📁 Folder: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903
📹 Videos found: 47 MP4 files
⏱️ Estimated audio: ~7.8 hours
💾 Estimated output size: ~450 MB (transcripts only)

**Processing:**
🖥️ GPU: NVIDIA GeForce RTX 4080 (detected)
🧠 Model: faster-whisper base (FP16, CUDA)
🎯 Segmentation: fixed 30s chunks
🗣️ Diarization: disabled (opt-in)
📝 Formats: txt, json, srt

**Output:**
💾 Audio extracts: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\audio\
📄 Transcripts: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\transcripts\
📊 INDEX.csv: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\transcripts\INDEX.csv

Ready to proceed? (Yes/No)
```

**NEVER begin processing without user confirmation.**

---

### 2. Audio Processing Pipeline

#### A. Audio Extraction (FFmpeg with Retry Matrix)

```python
# Primary extraction command
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

# Retry sequence on failure:
# 1. Codec fallback: pcm_s16le → flac
# 2. Add demuxer args: -fflags +genpts -rw_timeout 30000000
# 3. Extended probe: -analyzeduration 100M -probesize 100M
```

**Quality Checks:**
- Verify audio stream exists (ffprobe preflight)
- Check duration matches video duration
- Detect silent/corrupted audio
- Log extraction errors to `_FAILED.json`

#### B. Segmentation (Two Modes)

**Fixed Mode (Default):**
- Split audio into 30-second chunks
- Predictable processing time
- No external VAD required
- Best for continuous speech

**VAD Mode (Advanced):**
- Use Silero VAD to detect speech regions
- Variable-length segments (2-60s)
- Skip long silences
- Best for sparse audio (parking mode)

**Mutual Exclusion:** Only one mode active at a time.

#### C. GPU Transcription (faster-whisper)

```python
# Load model with GPU optimization
model = WhisperModel(
    "base",
    device="cuda",
    compute_type="float16"
)

# Transcribe with word-level timestamps
segments, info = model.transcribe(
    audio_path,
    beam_size=5,
    word_timestamps=True,
    vad_filter=True
)
```

**GPU Metrics Captured:**
- Device name, VRAM, utilization
- CUDA version, driver version
- Average GPU % during run (sampled at 1-2 Hz)
- Memory usage peaks

#### D. Speaker Diarization (Optional)

**Backends:**
- **pyannote**: State-of-the-art (requires HF token + VRAM)
- **speechbrain**: Good performance (no auth required)

**Label Normalization:**
- Different backends → unified `spkA`, `spkB`, etc.
- Consistent across INDEX.csv and JSON outputs

**Fallback Behavior:**
- If HF token missing → skip diarization, log warning
- If OOM error → disable diarization, continue transcription

---

### 3. Output Generation

#### A. File Organization (Per-Day Structure)

```
C:\Users\yousu\Desktop\CARDV\Movie_F\
└── 20250903\
    ├── audio\
    │   ├── 20250903133516_059495B.wav
    │   ├── 20250903134120_059496B.wav
    │   └── ... (47 files)
    ├── transcripts\
    │   ├── 20250903133516_059495B.txt
    │   ├── 20250903133516_059495B.json
    │   ├── 20250903133516_059495B.srt
    │   └── ... (47 × 3 = 141 files)
    └── INDEX.csv
```

#### B. Format Details

**TXT (Plain Text):**
```
[00:00:15] Speaker A: Hey, where are we going?
[00:00:18] Speaker B: Just heading to the mall.
[00:00:22] Speaker A: Okay, sounds good.
```

**JSON (Complete Metadata):**
```json
{
  "video_file": "20250903133516_059495B.MP4",
  "audio_duration_sec": 60,
  "language": "en",
  "language_confidence": 0.95,
  "segments": [
    {
      "start": 15.2,
      "end": 17.8,
      "text": "Hey, where are we going?",
      "confidence": 0.89,
      "speaker": "spkA",
      "words": [
        {"word": "Hey", "start": 15.2, "end": 15.4, "confidence": 0.92},
        {"word": "where", "start": 15.5, "end": 15.8, "confidence": 0.88}
      ]
    }
  ]
}
```

**SRT (SubRip Subtitles):**
```
1
00:00:15,200 --> 00:00:17,800
[spkA] Hey, where are we going?

2
00:00:18,000 --> 00:00:20,500
[spkB] Just heading to the mall.
```

**VTT (WebVTT):**
```
WEBVTT

00:00:15.200 --> 00:00:17.800
<v spkA>Hey, where are we going?

00:00:18.000 --> 00:00:20.500
<v spkB>Just heading to the mall.
```

#### C. INDEX.csv (Global Search Index)

Composite key: `(video_rel, seg_idx)`

| Column | Description |
|--------|-------------|
| `dataset` | Movie_F / Movie_R / Park_F / Park_R |
| `date` | YYYYMMDD |
| `video_rel` | Relative path from root |
| `video_stem` | Filename without extension |
| `seg_idx` | 0-based segment index |
| `ts_start_ms` | Segment start milliseconds |
| `ts_end_ms` | Segment end milliseconds |
| `text` | Transcript text (truncated to 512 chars) |
| `text_len` | Full text length |
| `lang` | ISO language code |
| `lang_conf` | Language detection confidence |
| `conf_avg` | Average token confidence |
| `speaker` | Normalized speaker label |
| `format_mask` | Files generated (txt/json/srt/vtt) |
| `transcript_file` | Basename |
| `audio_file` | Basename |
| `engine` | e.g., `faster-whisper:base:fp16` |
| `cuda_version` | CUDA version |
| `driver_version` | Driver version |
| `created_utc` | ISO 8601 timestamp |

#### D. Results JSON (Single Source of Truth)

```json
{
  "status": "ok",
  "summary": {
    "videos_processed": 47,
    "segments": 1847,
    "hours_audio": 7.8,
    "gpu_detected": true,
    "device_count": 1,
    "devices": [
      {
        "index": 0,
        "name": "NVIDIA GeForce RTX 4080",
        "total_mem_mb": 16384,
        "free_mem_mb": 14200
      }
    ],
    "utilization": {
      "gpu_pct": 35,
      "mem_pct": 42,
      "sampling_hz": 2
    },
    "cuda_version": "12.1",
    "driver_version": "546.01",
    "torch_version": "2.2.0+cu121",
    "errors": 0,
    "failed_files": []
  },
  "artifacts": {
    "index_csv": "C:\\Users\\yousu\\Desktop\\CARDV\\Movie_F\\20250903\\INDEX.csv",
    "output_dir": "C:\\Users\\yousu\\Desktop\\CARDV\\Movie_F\\20250903\\transcripts"
  }
}
```

---

### 4. Quality & Error Handling

#### A. Resume Safety
- Skip existing transcripts unless `--force` flag
- Idempotent: re-running is safe
- Checkpoint support for long runs

#### B. Error Types & Recovery

**Per-Video Failures** (`{video_stem}_FAILED.json`):
```json
{
  "video_path": "C:\\...\\video.mp4",
  "error_type": "ffmpeg_err",
  "error_message": "Failed to decode audio stream",
  "ffprobe_metadata": {"duration": null, "codec": "h264"},
  "timestamp": "2025-09-03T14:30:00Z"
}
```

Error types:
- `ffmpeg_err`: Audio extraction failed
- `decode_err`: Whisper decode failed
- `OOM`: Out of GPU memory
- `corrupted`: Container/stream corrupted
- `no_audio`: No audio stream detected

#### C. SRT/VTT Validation
- Strictly monotonic timestamps
- No overlapping segments
- Clamp gaps <50ms
- Proper timecode formatting (comma vs period)

---

## Implementation Guide

### Phase 1: Input Acquisition
```python
# 1. Parse user request
inputs = parse_user_request(user_message)

# 2. Auto-discover video files
if not inputs['video_folder']:
    inputs['video_folder'] = auto_discover_videos()

# 3. Validate inputs
validate_video_folder(inputs['video_folder'])
check_ffmpeg_available()
check_gpu_available()

# 4. Estimate resource requirements
estimate_processing_time(inputs)
estimate_disk_space(inputs)

# 5. Present configuration summary
show_configuration_summary(inputs)

# 6. Wait for confirmation
if not user_confirms():
    return  # Do not proceed
```

### Phase 2: Audio Extraction
```python
for video_file in video_files:
    # FFprobe preflight check
    metadata = ffprobe(video_file)
    if not has_audio_stream(metadata):
        log_failed(video_file, "no_audio")
        continue

    # Extract audio with retry
    try:
        audio_path = extract_audio_ffmpeg(
            video_file,
            output_dir=audio_output_dir,
            sample_rate=16000,
            channels=1
        )
    except FFmpegError as e:
        # Retry with fallback codec
        audio_path = extract_audio_ffmpeg_retry(video_file)
```

### Phase 3: Transcription
```python
# Load model once (reuse for all files)
model = load_whisper_model(
    model_size="base",
    device="cuda",
    compute_type="float16"
)

for audio_file in audio_files:
    # Segment audio
    if segmentation_mode == "fixed":
        chunks = segment_fixed(audio_file, chunk_size=30)
    else:
        chunks = segment_vad(audio_file, vad_model)

    # Transcribe each chunk
    all_segments = []
    for chunk in chunks:
        segments = model.transcribe(chunk)
        all_segments.extend(segments)

    # Optional: Diarization
    if diarization_enabled:
        all_segments = apply_diarization(audio_file, all_segments)
```

### Phase 4: Output Generation
```python
# Generate all formats
for video_file, segments in results.items():
    # TXT
    write_txt(segments, output_dir)

    # JSON
    write_json(segments, metadata, output_dir)

    # SRT
    srt_content = generate_srt(segments)
    validate_srt_monotonic(srt_content)
    write_srt(srt_content, output_dir)

    # VTT (optional)
    write_vtt(segments, output_dir)

    # Update INDEX.csv
    append_to_index(segments, index_csv_path)
```

### Phase 5: Completion Report
```python
# Generate results JSON
results_json = {
    "status": "ok",
    "summary": collect_statistics(),
    "artifacts": list_output_files(),
    "gpu_metrics": get_gpu_metrics()
}

# Save to file
save_results_json(results_json, output_dir)

# Report to user
print(f"✅ Complete! Processed {video_count} videos")
print(f"   Transcripts: {output_dir}")
print(f"   INDEX: {index_csv_path}")
print(f"   GPU Util: {avg_gpu_pct}%")
```

---

## Reference Materials

### In This Skill

- **SKILL_MANIFEST.md** - Complete technical specification (v2.0)
- **references/TECHNICAL_SPECIFICATION.md** - Detailed implementation rules
- **scripts/batch_transcriber.py** - Main batch processing script
- **scripts/audio_extractor.py** - FFmpeg wrapper with retry logic
- **scripts/transcriber.py** - Whisper transcription engine
- **scripts/diarizer.py** - Speaker diarization integration
- **scripts/format_writers.py** - TXT/JSON/SRT/VTT generators
- **scripts/gpu_monitor.py** - GPU metrics collection
- **scripts/validation.py** - Input validation and checks
- **assets/config_template.json** - Default configuration
- **assets/params.json** - Tunable parameters

### External Documentation

- faster-whisper documentation
- FFmpeg audio processing guide
- pyannote.audio diarization guide
- SRT/VTT subtitle format specifications

---

## Tunable Parameters

```json
{
  "whisper": {
    "model_size": "base",
    "device": "cuda",
    "compute_type": "float16",
    "batch_size": 8,
    "beam_size": 5,
    "language": "en",
    "detect_language": false
  },
  "audio": {
    "sample_rate": 16000,
    "channels": 1,
    "format": "wav",
    "keep_intermediate": false
  },
  "segmentation": {
    "mode": "fixed",
    "chunk_length_sec": 30,
    "vad_min_len_sec": 2,
    "vad_max_len_sec": 60
  },
  "diarization": {
    "enabled": false,
    "backend": "pyannote",
    "min_speakers": 1,
    "max_speakers": 10
  },
  "output": {
    "formats": ["txt", "json", "srt"],
    "text_truncate_csv": 512
  },
  "parallel": {
    "max_workers": 3
  }
}
```

---

## Common Issues & Solutions

### Issue 1: "GPU not detected"
**Cause:** CUDA not installed or incompatible driver
**Solution:**
1. Check: `python -c "import torch; print(torch.cuda.is_available())"`
2. Install/update CUDA toolkit
3. Fallback to CPU: `--device cpu`

### Issue 2: "FFmpeg command failed"
**Cause:** FFmpeg not in PATH or unsupported codec
**Solution:**
1. Verify: `ffmpeg -version`
2. Install from ffmpeg.org
3. Use retry matrix with codec fallback

### Issue 3: "Out of memory (OOM)"
**Cause:** GPU VRAM insufficient for model + batch size
**Solution:**
1. Use smaller model: `tiny` or `small`
2. Reduce batch size: `--batch 4`
3. Process fewer files in parallel

### Issue 4: "Diarization failed"
**Cause:** HF token missing or network error
**Solution:**
1. Set token: `export HF_TOKEN=hf_...`
2. Accept pyannote license on HuggingFace
3. Disable diarization: `--no-diarize`

### Issue 5: "SRT validation errors"
**Cause:** Overlapping timestamps or malformed timecodes
**Solution:**
1. Enable timestamp clamping: `--clamp-gaps`
2. Check for negative durations
3. Validate with subtitle validator tool

---

## Security & Privacy Notes

### Data Sensitivity
Dashcam audio may contain:
- Personal conversations
- Addresses and locations
- Phone numbers and names
- Private information

### Processing Guidelines
1. **Local Processing Only** - Never upload audio to external services
2. **Secure Storage** - Encrypt transcripts if sharing devices
3. **Redaction** - Use `--redact` flag for PII patterns (phone, email)
4. **Retention** - Delete audio extracts after transcription if not needed

### Investigation Use
- Designed for legitimate personal data analysis
- NOT an anti-forensics tool
- All conclusions require independent corroboration

---

## Skill Invocation

This skill is invoked when the model detects:
1. User mentions "transcribe audio", "dashcam transcription", or "extract speech"
2. User requests processing of video/MP4 files for audio content
3. User provides paths to video folders
4. User asks for subtitles or timestamped transcripts

---

## Success Criteria

A successful audio transcription must:

✅ Obtain all required inputs from user (video folder, output preferences)
✅ Validate all inputs before processing (files exist, FFmpeg available, GPU detected)
✅ Present configuration summary and get confirmation
✅ Extract audio successfully (or log failures)
✅ Transcribe with GPU acceleration (or CPU fallback)
✅ Generate all requested formats (TXT, JSON, SRT, VTT)
✅ Create INDEX.csv with searchable metadata
✅ Include GPU metrics in results JSON
✅ Report output locations to user

**Key Principle:** Never guess critical inputs. Always validate, confirm, and provide clear feedback.

---

## Version History

- **v2.0** (2025-10-26) - Production-ready skill with GPU-first architecture
- **v1.5** (2025-10-25) - Added diarization support and retry matrix
- **v1.0** (2025-10-20) - Initial release with basic transcription

---

**Last Updated:** 2025-10-26
**Status:** Production Ready
**Maintained By:** Audio Transcription Pipeline Project

Overview

This skill extracts audio from dashcam MP4 files and produces GPU-accelerated, timestamped transcripts with optional speaker diarization. It organizes outputs by date, generates multi-format artifacts (TXT/JSON/SRT/VTT), and produces an INDEX.csv plus a results summary with GPU and quality metrics. Designed for reliable batch processing of date-organized video folders and single files.

How this skill works

The skill auto-discovers or accepts a user-specified folder of MP4s, validates files and environment (FFmpeg, GPU), then extracts audio via FFmpeg with a retry matrix. Audio is segmented (fixed chunks or VAD-based), transcribed on GPU using faster-whisper with word timestamps, and optionally passed to a diarization backend (pyannote or speechbrain). Outputs and metadata are written per-day with error handling, resume safety, and a global INDEX.csv and results JSON containing GPU metrics and processing statistics.

When to use it

You need transcripts from dashcam MP4 footage or other video files.
You want timestamped transcripts with optional speaker labels (diarization).
You need batch processing for a folder or date range of videos.
You want GPU-accelerated transcription for faster throughput.
You need multi-format outputs (TXT/JSON/SRT/VTT) and a searchable INDEX.csv.

Best practices

Always confirm the configuration summary before starting to avoid accidental processing.
Provide an explicit folder path or date when possible to speed auto-discovery.
Enable diarization only if a valid HF token and sufficient VRAM are available; otherwise use speechbrain or skip.
Use fixed 30s segmentation for continuous speech; use VAD mode for sparse audio like parking recordings.
Keep FFmpeg and CUDA drivers up to date and ensure adequate disk space for audio extracts and outputs.

Example use cases

Forensic review: transcribe a day of dashcam videos and search INDEX.csv for keywords and speaker turns.
Incident analysis: extract audio, generate timestamped transcripts and SRT subtitles for sharing with investigators.
Bulk processing: run on a date-organized folder to produce per-video JSON with word-level timestamps and GPU metrics.
Quality auditing: re-run with --force to regenerate transcripts after model or parameter changes while preserving FAILED logs.
Subtitling: create SRT/VTT with normalized speaker labels for video publishing or evidence presentation.

FAQ

Do you start processing immediately after the user request?

No. I always present a configuration summary and require explicit user confirmation before any processing begins.

What happens if a video has no audio or extraction fails?

The tool logs a {video_stem}_FAILED.json with error_type (e.g., no_audio, ffmpeg_err) and continues; failed files are listed in the final results JSON.

Is a GPU required?

A GPU is recommended for speed and is used when available; CPU-only runs are possible but much slower. GPU metrics are captured in the results JSON.