home / skills / willsigmon / sigstack / audio-expert

This skill helps you manage audio systems for Leavn and Modcaster, enabling real-time TTS, caption syncing, recording, and fingerprinting across devices.

npx playbooks add skill willsigmon/sigstack --skill audio-expert

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
1.5 KB
---
name: Audio Expert
description: Leavn and Modcaster audio systems - TTS, streaming, recording, fingerprinting, ML processing, caption sync
allowed-tools: Read, Edit, Grep
---

# Audio Expert

Audio architecture for Leavn (Bible app) and Modcaster (podcast app).

## Leavn Audio Systems

### Core Services
- `AudioPlaybackService` - TTS + verse reading
- `GuidedAudioOrchestrator` - Guided mode coordination
- `SermonRecordingService` - Sermon capture
- `StreamingTTSClient` - Real-time TTS
- `ChatterboxKit` - Voice profiles

### Caption & Timing
- `CaptionTimecodeAligner` - Precision sync
- `CaptionSyncCoordinator` - State machine
- Verse boundary detection

### Audio Graph
- `AudioGraphManager` initialization
- Scene modes: CarPlay, background, etc.

### Common Fixes
- Audio conflicts/interruptions
- TTS voice selection
- Recording state management
- Caption synchronization

## Modcaster Audio Systems

### Audio Fingerprinting
Spectral peak extraction (Shazam-style):
1. Apply FFT using vDSP
2. Extract spectral peaks
3. Create constellation map
4. Hash into compact fingerprint

Use cases: Intro/outro detection, ad identification, cross-show matching

### On-Device ML
- `AVAudioEngine` tap installation, buffer processing
- `Core ML` models for voice enhancement, noise reduction
- `Sound Analysis` for speech/music classification
- Neural Engine utilization

### Content Classification
- Episode type: full/trailer/bonus
- Ad segment detection
- Intro/outro recognition
- Speech vs music separation

Overview

This skill describes an audio infrastructure blueprint for Leavn (Bible app) and Modcaster (podcast app). It covers TTS, streaming, recording, fingerprinting, on-device ML, and caption synchronization to build reliable playback, capture, and content analysis pipelines. The emphasis is practical: services, common fixes, and component interactions for production mobile/audio systems.

How this skill works

The architecture splits responsibilities into focused services: playback and TTS, guided orchestration, sermon recording, streaming TTS, and voice profile management. Caption alignment and stateful caption sync ensure precise verse timing. For podcast workflows, the system extracts spectral peaks, hashes constellation maps into compact fingerprints, and runs on-device ML taps for enhancement and classification.

When to use it

  • Implementing TTS-driven verse reading and real-time narration in a mobile app.
  • Building reliable sermon recording and background audio capture with state management.
  • Detecting intros, outros, and ad breaks across a podcast catalog via fingerprinting.
  • On-device noise reduction and voice enhancement without continuous server roundtrips.
  • Ensuring word-accurate caption timing and synchronization for dynamic content.

Best practices

  • Isolate audio responsibilities into small services (playback, recording, orchestration) to reduce state coupling.
  • Use a state machine for caption sync to handle interruptions and resume correctly.
  • Prioritize low-latency, incremental fingerprint hashing for streaming detection and cross-show matching.
  • Install audio taps early in the AVAudioEngine pipeline to preserve raw buffers for ML and analysis.
  • Select TTS voices dynamically based on device capability and session context to avoid abrupt switches.

Example use cases

  • Leavn: synchronized verse-by-verse TTS with precise caption timing and guided listening modes.
  • Modcaster: identify recurring ad segments across episodes using spectral fingerprinting.
  • Background recording that survives audio interruptions and correctly resumes stateful captures.
  • On-device pipeline for speech/music separation to drive chaptering or monetization decisions.
  • Real-time streaming TTS for live narration or accessibility features with low latency.

FAQ

How do I avoid audio interruptions when switching scenes?

Centralize audio session management in an AudioGraphManager and use explicit scene modes (CarPlay, background) with prioritized resource handling to prevent conflicts.

Where should I run ML models for noise reduction?

Prefer Core ML models executed on the Neural Engine with AVAudioEngine taps for buffer capture; this keeps processing local and minimizes network dependency.