home / skills / openclaw / skills / macos-local-voice

macos-local-voice skill

needs review

This skill enables fully local macOS speech-to-text and text-to-speech using native Apple tools, delivering offline privacy and quick voice interactions.

npx playbooks add skill openclaw/skills --skill macos-local-voice

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

3.5 KB

---
name: macos-local-voice
description: "Local STT and TTS on macOS using native Apple capabilities. Speech-to-text via yap (Apple Speech.framework), text-to-speech via say + ffmpeg. Fully offline, no API keys required. Includes voice quality detection and smart voice selection."
metadata: { "openclaw": { "emoji": "🎙️", "requires": { "bins": ["yap", "say", "osascript"], "os": ["darwin"] } } }
---

# macOS Local Voice

Fully local speech-to-text (STT) and text-to-speech (TTS) on macOS. No API keys, no network, no cloud. All processing happens on-device.

## Requirements

- macOS (Apple Silicon recommended, Intel works too)
- `yap` CLI in PATH — install via `brew install finnvoor/tools/yap`
- `ffmpeg` in PATH (optional, needed for ogg/opus output) — `brew install ffmpeg`
- `say` and `osascript` are macOS built-in

## Speech-to-Text (STT)

Transcribe an audio file to text using Apple's on-device speech recognition.

```bash
node {baseDir}/scripts/stt.mjs <audio_file> [locale]
```

- `audio_file`: path to audio (ogg, m4a, mp3, wav, etc.)
- `locale`: optional, e.g. `zh_CN`, `en_US`, `ja_JP`. If omitted, uses system default.
- Outputs transcribed text to stdout.

### Supported STT locales

Use `node {baseDir}/scripts/stt.mjs --locales` to list all supported locales.

Key locales: `en_US`, `en_GB`, `zh_CN`, `zh_TW`, `zh_HK`, `ja_JP`, `ko_KR`, `fr_FR`, `de_DE`, `es_ES`, `pt_BR`, `ru_RU`, `vi_VN`, `th_TH`.

### Language detection tips

- If the user's recent messages are in Chinese → use `zh_CN`
- If in English → use `en_US`
- If mixed or unclear → try without locale (system default)

## Text-to-Speech (TTS)

Convert text to an audio file using macOS native TTS.

```bash
node {baseDir}/scripts/tts.mjs "<text>" [voice_name] [output_path]
```

- `text`: the text to speak
- `voice_name`: optional, e.g. `Yue (Premium)`, `Tingting`, `Ava (Premium)`. If omitted, auto-selects the best available voice based on text language.
- `output_path`: optional, defaults to a timestamped file in `~/.openclaw/media/outbound/`
- Outputs the generated audio file path to stdout.
- If `ffmpeg` is available, output is ogg/opus (ideal for messaging platforms). Otherwise aiff.

### Sending as voice note

After generating the audio file, send it using the `message` tool:

```
message action=send media=<path_from_tts.sh> asVoice=true
```

## Voice Management

List available voices, check readiness, or find the best voice for a language:

```bash
node {baseDir}/scripts/voices.mjs list [locale]     # List voices, optionally filter by locale
node {baseDir}/scripts/voices.mjs check "<name>"     # Check if a specific voice is downloaded and ready
node {baseDir}/scripts/voices.mjs best <locale>       # Get the highest quality voice for a locale
```

### Quality levels

- 1 = compact (low quality, always available)
- 2 = enhanced (mid quality, may need download)
- 3 = premium (highest quality, needs download from System Settings)

### If a voice is not available

Tell the user: "Voice X is not downloaded. Go to **System Settings → Accessibility → Spoken Content → System Voice → Manage Voices** to download it."

## Notes

- The `say` command silently falls back to a default voice if the requested voice is not available (exit code 0, no error). **Always** use `voices.mjs check` before calling `tts.mjs` with a specific voice name.
- Premium voices (e.g. `Yue (Premium)`, `Ava (Premium)`) sound significantly better but must be manually downloaded by the user.
- Siri voices are not accessible via the speech synthesis API.

Overview

This skill provides fully local speech-to-text (STT) and text-to-speech (TTS) on macOS using native Apple capabilities. It requires no network access or API keys and runs entirely on-device using yap (Speech.framework), say, and ffmpeg when available. It also includes voice quality detection and smart voice selection to pick the best available voice for a language.

How this skill works

STT runs via the yap CLI which wraps Apple Speech.framework to transcribe audio files in many locales. TTS uses the built-in say command to synthesize speech; when ffmpeg is installed, output is converted to ogg/opus for messaging compatibility, otherwise aiff is produced. The skill inspects installed voices, their quality levels (compact, enhanced, premium), and checks readiness before selecting or warning about unavailable voices.

When to use it

Transcribing recorded audio locally without sending data to the cloud
Generating voice notes or synthesized replies for messaging platforms while keeping audio on-device
Selecting the highest-quality available voice for a specific language or locale
Producing ogg/opus output for chat apps when ffmpeg is installed
Checking whether a premium voice is downloaded before attempting TTS

Best practices

Run voices.mjs check before calling tts.mjs with a specific voice name to avoid silent fallbacks
Install yap and ffmpeg via Homebrew for best compatibility and output formats
Prefer premium voices for higher quality but instruct users to download them via System Settings if missing
Specify locale for STT when you know the language to improve accuracy; omit locale to use system default for mixed-language input
Store generated audio in ~/.openclaw/media/outbound/ or pass an explicit output path for easier management

Example use cases

Transcribe meeting recordings locally with node scripts to generate searchable text
Auto-generate voice notes from chat replies and send them as asVoice messages (ogg/opus)
Deploy an offline voice assistant on a macOS machine with language-aware voice selection
Batch convert text responses to high-quality audio using premium voices you have downloaded
Quickly check which voices are installed and identify the best voice for zh_CN or en_US

FAQ

Do I need an internet connection or API keys?

No. All transcription and synthesis run on-device using macOS frameworks; no network access or API keys are required.

Why did say produce a different voice than I requested?

say silently falls back to an available voice if the requested voice is not downloaded. Always run voices.mjs check "Voice Name" first and prompt the user to download premium voices in System Settings if needed.