home / skills / charleswiltgen / axiom / axiom-ios-ml

This skill implements on-device speech-to-text with SpeechAnalyzer, delivering live and file transcription while preserving timing and privacy.

npx playbooks add skill charleswiltgen/axiom --skill axiom-ios-ml

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
3.6 KB
---
name: speech
description: Use when implementing speech-to-text, live transcription, or audio transcription. Covers SpeechAnalyzer (iOS 26+), SpeechTranscriber, volatile/finalized results, AssetInventory model management, audio format handling.
version: 1.0.0
---

# Speech-to-Text with SpeechAnalyzer

## Overview

SpeechAnalyzer is Apple's new speech-to-text API introduced in iOS 26. It powers Notes, Voice Memos, Journal, and Call Summarization. The on-device model is faster, more accurate, and better for long-form/distant audio than SFSpeechRecognizer.

**Key principle**: SpeechAnalyzer is modular—add transcription modules to an analysis session. Results stream asynchronously using Swift's AsyncSequence.

## Decision Tree - SpeechAnalyzer vs SFSpeechRecognizer

```
Need speech-to-text?
  ├─ iOS 26+ only?
  │   └─ Yes → SpeechAnalyzer (preferred)
  ├─ Need iOS 10-25 support?
  │   └─ Yes → SFSpeechRecognizer (or DictationTranscriber)
  ├─ Long-form audio (meetings, lectures)?
  │   └─ Yes → SpeechAnalyzer
  ├─ Distant audio (across room)?
  │   └─ Yes → SpeechAnalyzer
  └─ Short dictation commands?
      └─ Either works
```

**SpeechAnalyzer advantages**:
- Better for long-form and conversational audio
- Works well with distant speakers (meetings)
- On-device, private
- Model managed by system (no app size increase)
- Powers Notes, Voice Memos, Journal

**DictationTranscriber** (iOS 26+): Same languages as SFSpeechRecognizer, but doesn't require user to enable Siri/dictation in Settings.

## Red Flags

Use this skill when you see:
- "Live transcription"
- "Transcribe audio"
- "Speech-to-text"
- "SpeechAnalyzer" or "SpeechTranscriber"
- "Volatile results"
- Building Notes-like or Voice Memos-like features

## Pattern 1 - File Transcription (Simplest)

Transcribe an audio file to text in one function.

```swift
import Speech

func transcribe(file: URL, locale: Locale) async throws -> AttributedString {
    // Set up transcriber
    let transcriber = SpeechTranscriber(
        locale: locale,
        preset: .offlineTranscription
    )

    // Collect results asynchronously
    async let transcriptionFuture = try transcriber.results
        .reduce(AttributedString()) { str, result in
            str + result.text
        }

    // Set up analyzer with transcriber module
    let analyzer = SpeechAnalyzer(modules: [transcriber])

    // Analyze the file
    if let lastSample = try await analyzer.analyzeSequence(from: file) {
        try await analyzer.finalizeAndFinish(through: lastSample)
    } else {
        await analyzer.cancelAndFinishNow()
    }

    return try await transcriptionFuture
}
```

**Key points**:
- `analyzeSequence(from:)` reads file and feeds audio to analyzer
- `finalizeAndFinish(through:)` ensures all results are finalized
- Results are `AttributedString` with timing metadata

## Pattern 2 - Live Transcription Setup

For real-time transcription from microphone.

### Step 1 - Configure SpeechTranscriber

```swift
import Speech

class TranscriptionManager: ObservableObject {
    private var transcriber: SpeechTranscriber?
    private var analyzer: SpeechAnalyzer?
    private var analyzerFormat: AudioFormatDescription?
    private var inputBuilder: AsyncStream<AnalyzerInput>.Continuation?

    @Published var finalizedTranscript = AttributedString()
    @Published var volatileTranscript = AttributedString()

    func setUp() async throws {
        // Create transcriber with options
        transcriber = SpeechTranscriber(
            locale: Locale.current,
            transcriptionOptions: [],
            reportingOptions: [.volatileResults],  // Enable real-time updates
            attributeOptions: [.audioTimeRange]     // Include timing
        )

        guard let transcriber else { throw TranscriptionError.setupFailed }

        // Create analyzer with transcriber module
        analyzer = SpeechAnalyzer(modules: [transcriber])

        // Get required audio format
        analyzerFormat = await SpeechAnalyzer.bestAvailableAudioFormat(
            compatibleWith: [transcriber]
        )

        // Ensure model is available
        try await ensureModel(for: transcriber)

        // Create input stream
        let (stream, builder) = AsyncStream<AnalyzerInput>.makeStream()
        inputBuilder = builder

        // Start analyzer
        try await analyzer?.start(inputSequence: stream)
    }
}
```

### Step 2 - Ensure Model Availability

```swift
func ensureModel(for transcriber: SpeechTranscriber) async throws {
    let locale = Locale.current

    // Check if language is supported
    let supported = await SpeechTranscriber.supportedLocales
    guard supported.contains(where: {
        $0.identifier(.bcp47) == locale.identifier(.bcp47)
    }) else {
        throw TranscriptionError.localeNotSupported
    }

    // Check if model is installed
    let installed = await SpeechTranscriber.installedLocales
    if installed.contains(where: {
        $0.identifier(.bcp47) == locale.identifier(.bcp47)
    }) {
        return  // Already installed
    }

    // Download model
    if let downloader = try await AssetInventory.assetInstallationRequest(
        supporting: [transcriber]
    ) {
        // Track progress if needed
        let progress = downloader.progress
        try await downloader.downloadAndInstall()
    }
}
```

**Note**: Models are stored in system storage, not app storage. Limited number of languages can be allocated at once.

### Step 3 - Handle Results

```swift
func startResultHandling() {
    Task {
        guard let transcriber else { return }

        do {
            for try await result in transcriber.results {
                let text = result.text

                if result.isFinal {
                    // Finalized result - won't change
                    finalizedTranscript += text
                    volatileTranscript = AttributedString()

                    // Access timing info
                    for run in text.runs {
                        if let timeRange = run.audioTimeRange {
                            print("Time: \(timeRange)")
                        }
                    }
                } else {
                    // Volatile result - will be replaced
                    volatileTranscript = text
                }
            }
        } catch {
            print("Transcription failed: \(error)")
        }
    }
}
```

## Pattern 3 - Audio Recording and Streaming

Connect AVAudioEngine to SpeechAnalyzer.

```swift
import AVFoundation

class AudioRecorder {
    private let audioEngine = AVAudioEngine()
    private var outputContinuation: AsyncStream<AVAudioPCMBuffer>.Continuation?
    private let transcriptionManager: TranscriptionManager

    func startRecording() async throws {
        // Request permission
        guard await AVAudioApplication.requestRecordPermission() else {
            throw RecordingError.permissionDenied
        }

        // Configure audio session (iOS)
        #if os(iOS)
        let session = AVAudioSession.sharedInstance()
        try session.setCategory(.playAndRecord, mode: .spokenAudio)
        try session.setActive(true, options: .notifyOthersOnDeactivation)
        #endif

        // Set up transcriber
        try await transcriptionManager.setUp()
        transcriptionManager.startResultHandling()

        // Stream audio to transcriber
        for await buffer in try audioStream() {
            try await transcriptionManager.streamAudio(buffer)
        }
    }

    private func audioStream() throws -> AsyncStream<AVAudioPCMBuffer> {
        let inputNode = audioEngine.inputNode
        let format = inputNode.outputFormat(forBus: 0)

        inputNode.installTap(
            onBus: 0,
            bufferSize: 4096,
            format: format
        ) { [weak self] buffer, time in
            self?.outputContinuation?.yield(buffer)
        }

        audioEngine.prepare()
        try audioEngine.start()

        return AsyncStream { continuation in
            outputContinuation = continuation
        }
    }
}
```

### Stream Audio with Format Conversion

```swift
extension TranscriptionManager {
    private var converter: AVAudioConverter?

    func streamAudio(_ buffer: AVAudioPCMBuffer) async throws {
        guard let inputBuilder, let analyzerFormat else {
            throw TranscriptionError.notSetUp
        }

        // Convert to analyzer's required format
        let converted = try convertBuffer(buffer, to: analyzerFormat)

        // Send to analyzer
        let input = AnalyzerInput(buffer: converted)
        inputBuilder.yield(input)
    }

    private func convertBuffer(
        _ buffer: AVAudioPCMBuffer,
        to format: AudioFormatDescription
    ) throws -> AVAudioPCMBuffer {
        // Lazy initialize converter
        if converter == nil {
            let sourceFormat = buffer.format
            let destFormat = AVAudioFormat(cmAudioFormatDescription: format)!
            converter = AVAudioConverter(from: sourceFormat, to: destFormat)
        }

        guard let converter else {
            throw TranscriptionError.conversionFailed
        }

        let outputBuffer = AVAudioPCMBuffer(
            pcmFormat: converter.outputFormat,
            frameCapacity: buffer.frameLength
        )!

        try converter.convert(to: outputBuffer, from: buffer)
        return outputBuffer
    }
}
```

## Pattern 4 - Stopping Transcription

Properly finalize to get remaining volatile results as finalized.

```swift
func stopRecording() async {
    // Stop audio
    audioEngine.stop()
    audioEngine.inputNode.removeTap(onBus: 0)
    outputContinuation?.finish()

    // Finalize transcription (converts remaining volatile to final)
    try? await analyzer?.finalizeAndFinishThroughEndOfInput()

    // Cancel any pending tasks
    recognizerTask?.cancel()
}
```

**Critical**: Always call `finalizeAndFinishThroughEndOfInput()` to ensure volatile results are finalized.

## Pattern 5 - Model Asset Management

### Check Supported Languages

```swift
// Languages the API supports
let supported = await SpeechTranscriber.supportedLocales

// Languages currently installed on device
let installed = await SpeechTranscriber.installedLocales
```

### Deallocate Languages

Limited number of languages can be allocated. Deallocate unused ones.

```swift
func deallocateLanguages() async {
    let allocated = await AssetInventory.allocatedLocales
    for locale in allocated {
        await AssetInventory.deallocate(locale: locale)
    }
}
```

## Pattern 6 - Displaying Results with Timing

Highlight text during audio playback using timing metadata.

```swift
struct TranscriptView: View {
    let transcript: AttributedString
    @Binding var playbackTime: CMTime

    var body: some View {
        Text(highlightedTranscript)
    }

    var highlightedTranscript: AttributedString {
        var result = transcript

        for (range, run) in transcript.runs {
            guard let timeRange = run.audioTimeRange else { continue }

            let isActive = timeRange.containsTime(playbackTime)
            if isActive {
                result[range].backgroundColor = .yellow
            }
        }

        return result
    }
}
```

## Anti-Patterns

### Don't - Forget to finalize

```swift
// BAD - volatile results lost
func stopRecording() {
    audioEngine.stop()
    // Missing finalize!
}

// GOOD - volatile results become finalized
func stopRecording() async {
    audioEngine.stop()
    try? await analyzer?.finalizeAndFinishThroughEndOfInput()
}
```

### Don't - Ignore format conversion

```swift
// BAD - format mismatch may fail silently
inputBuilder.yield(AnalyzerInput(buffer: rawBuffer))

// GOOD - convert to analyzer's format
let format = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber])
let converted = try convertBuffer(rawBuffer, to: format)
inputBuilder.yield(AnalyzerInput(buffer: converted))
```

### Don't - Skip model availability check

```swift
// BAD - may crash if model not installed
let transcriber = SpeechTranscriber(locale: locale, ...)
// Start using immediately

// GOOD - ensure model is ready
let transcriber = SpeechTranscriber(locale: locale, ...)
try await ensureModel(for: transcriber)
// Now safe to use
```

## Presets Reference

| Preset | Use Case |
|--------|----------|
| `.offlineTranscription` | File transcription, no real-time feedback needed |
| `.progressiveLiveTranscription` | Live transcription with volatile updates |

## Options Reference

### TranscriptionOptions
- Default: None (standard transcription)

### ReportingOptions
- `.volatileResults`: Enable real-time approximate results

### AttributeOptions
- `.audioTimeRange`: Include CMTimeRange for each text segment

## Platform Availability

| Platform | SpeechTranscriber | DictationTranscriber |
|----------|-------------------|---------------------|
| iOS 26+ | Yes | Yes |
| macOS Tahoe+ | Yes | Yes |
| watchOS 26+ | No | Yes |
| tvOS 26+ | TBD | TBD |

**Hardware requirements**: Varies by device. Use `supportedLocales` to check.

## Integration with Apple Intelligence

Combine with Foundation Models for summarization:

```swift
import FoundationModels

func generateTitle(for transcript: String) async throws -> String {
    let session = LanguageModelSession()
    let prompt = "Generate a short, clever title for this story: \(transcript)"
    let response = try await session.respond(to: prompt)
    return response.content
}
```

See `axiom-ios-ai` skill for Foundation Models details.

## Checklist

Before shipping speech-to-text:

- [ ] Check locale support with `supportedLocales`
- [ ] Ensure model with `AssetInventory.assetInstallationRequest`
- [ ] Handle download progress for user feedback
- [ ] Convert audio to `bestAvailableAudioFormat`
- [ ] Enable `.volatileResults` for live transcription
- [ ] Call `finalizeAndFinishThroughEndOfInput()` on stop
- [ ] Handle timing with `.audioTimeRange` if needed
- [ ] Clear volatile results when finalized result arrives
- [ ] Request microphone permission before recording

## Resources

**WWDC**: 2025-277

**Docs**: /speech, /speech/speechanalyzer, /speech/speechtranscriber

**Skills**: coreml (on-device ML), axiom-ios-ai (Foundation Models)

Overview

This skill describes implementing speech-to-text and live transcription on modern xOS using Apple's SpeechAnalyzer and SpeechTranscriber APIs. It focuses on on-device models (iOS 26+), handling volatile vs finalized results, audio format conversion, and AssetInventory model management. Practical patterns include file transcription, live microphone streaming, recording integration, and safe shutdown to preserve final transcripts.

How this skill works

SpeechAnalyzer is modular: you add transcription modules (e.g., SpeechTranscriber) to an analysis session and stream audio via an AsyncSequence. Results arrive asynchronously as volatile (interim) and final segments, often as AttributedString with audio timing metadata. AssetInventory handles language model installation; use SpeechAnalyzer.bestAvailableAudioFormat to convert audio before yielding AnalyzerInput.

When to use it

  • Building live transcription or real-time captioning features
  • Transcribing audio files or long-form meetings/lectures
  • When on-device privacy and low-latency transcription are required (iOS 26+)
  • Highlighting transcript text during playback using timing metadata
  • Implementing Notes/Voice Memos style features with volatile + finalized results

Best practices

  • Always check SpeechTranscriber.supportedLocales and ensure model installation via AssetInventory before starting
  • Request microphone permission and configure AVAudioSession before recording
  • Convert AVAudioPCMBuffer to SpeechAnalyzer.bestAvailableAudioFormat to avoid silent failures
  • Enable .volatileResults for live updates and call finalizeAndFinishThroughEndOfInput() when stopping to finalize remaining volatile text
  • Deallocate unused language models with AssetInventory.deallocate to free system allocations

Example use cases

  • One-shot file transcription using SpeechTranscriber with offlineTranscription preset
  • Real-time mic transcription: stream AVAudioEngine buffers into AnalyzerInput and update volatileTranscript constantly
  • Meeting recorder: collect finalized segments, use audioTimeRange attributes to drive word-highlighting during playback
  • Language model integration: summarize or generate titles from the final transcript via a foundation model
  • Device management: download and deallocate speech models based on user-selected locales

FAQ

What if the device lacks the speech model for the locale?

Check SpeechTranscriber.installedLocales and request installation with AssetInventory.assetInstallationRequest; present download progress to users.

How do I avoid losing interim results when stopping?

Always call analyzer.finalizeAndFinishThroughEndOfInput() or finalizeAndFinish(through:) to convert volatile results to finalized before finishing.

Do I need to convert audio buffers before sending?

Yes. Use SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith:) and convert AVAudioPCMBuffer via AVAudioConverter to the required format.