home / skills / inference-sh / skills / speech-to-text

speech-to-text skill

/tools/audio/speech-to-text

This skill transcribes audio to text using Whisper models via the inference.sh CLI, enabling fast, accurate transcripts with optional timestamps.

This is most likely a fork of the speech-to-text skill from openclaw
npx playbooks add skill inference-sh/skills --skill speech-to-text

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.7 KB
---
name: speech-to-text
description: "Transcribe audio to text with Whisper models via inference.sh CLI. Models: Fast Whisper Large V3, Whisper V3 Large. Capabilities: transcription, translation, multi-language, timestamps. Use for: meeting transcription, subtitles, podcast transcripts, voice notes. Triggers: speech to text, transcription, whisper, audio to text, transcribe audio, voice to text, stt, automatic transcription, subtitles generation, transcribe meeting, audio transcription, whisper ai"
allowed-tools: Bash(infsh *)
---

# Speech-to-Text

Transcribe audio to text via [inference.sh](https://inference.sh) CLI.

![Speech-to-Text](https://cloud.inference.sh/u/4mg21r6ta37mpaz6ktzwtt8krr/01jz025e88nkvw55at1rqtj5t8.png)

## Quick Start

> Requires inference.sh CLI (`infsh`). Get installation instructions: `npx skills add inference-sh/skills@agent-tools`

```bash
infsh login

infsh app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://audio.mp3"}'
```


## Available Models

| Model | App ID | Best For |
|-------|--------|----------|
| Fast Whisper V3 | `infsh/fast-whisper-large-v3` | Fast transcription |
| Whisper V3 Large | `infsh/whisper-v3-large` | Highest accuracy |

## Examples

### Basic Transcription

```bash
infsh app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://meeting.mp3"}'
```

### With Timestamps

```bash
infsh app sample infsh/fast-whisper-large-v3 --save input.json

# {
#   "audio_url": "https://podcast.mp3",
#   "timestamps": true
# }

infsh app run infsh/fast-whisper-large-v3 --input input.json
```

### Translation (to English)

```bash
infsh app run infsh/whisper-v3-large --input '{
  "audio_url": "https://french-audio.mp3",
  "task": "translate"
}'
```

### From Video

```bash
# Extract audio from video first
infsh app run infsh/video-audio-extractor --input '{"video_url": "https://video.mp4"}' > audio.json

# Transcribe the extracted audio
infsh app run infsh/fast-whisper-large-v3 --input '{"audio_url": "<audio-url>"}'
```

## Workflow: Video Subtitles

```bash
# 1. Transcribe video audio
infsh app run infsh/fast-whisper-large-v3 --input '{
  "audio_url": "https://video.mp4",
  "timestamps": true
}' > transcript.json

# 2. Use transcript for captions
infsh app run infsh/caption-videos --input '{
  "video_url": "https://video.mp4",
  "captions": "<transcript-from-step-1>"
}'
```

## Supported Languages

Whisper supports 99+ languages including:
English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, and many more.

## Use Cases

- **Meetings**: Transcribe recordings
- **Podcasts**: Generate transcripts
- **Subtitles**: Create captions for videos
- **Voice Notes**: Convert to searchable text
- **Interviews**: Transcription for research
- **Accessibility**: Make audio content accessible

## Output Format

Returns JSON with:
- `text`: Full transcription
- `segments`: Timestamped segments (if requested)
- `language`: Detected language

## Related Skills

```bash
# Full platform skill (all 150+ apps)
npx skills add inference-sh/skills@agent-tools

# Text-to-speech (reverse direction)
npx skills add inference-sh/skills@text-to-speech

# Video generation (add captions)
npx skills add inference-sh/skills@ai-video-generation

# AI avatars (lipsync with transcripts)
npx skills add inference-sh/skills@ai-avatar-video
```

Browse all audio apps: `infsh app list --category audio`

## Documentation

- [Running Apps](https://inference.sh/docs/apps/running) - How to run apps via CLI
- [Audio Transcription Example](https://inference.sh/docs/examples/audio-transcription) - Complete transcription guide
- [Apps Overview](https://inference.sh/docs/apps/overview) - Understanding the app ecosystem

Overview

This skill transcribes audio to text using Whisper-family models via the inference.sh CLI. It supports fast or high-accuracy models, multi-language transcription, optional timestamps, and on-the-fly translation to English. Use it to convert meetings, podcasts, videos, and voice notes into searchable, editable text. The outputs are returned as JSON with full text, optional segments, and detected language.

How this skill works

The skill runs an inference.sh app that accepts an audio or video URL and runs a chosen Whisper model (Fast Whisper Large V3 or Whisper V3 Large). You can enable timestamps for segmented output or set task: "translate" to produce English translations. The CLI returns JSON containing text, segments (when requested), and detected language, which you can pipe into downstream tools for captions or analysis.

When to use it

  • Transcribing meeting recordings to searchable text
  • Generating podcast transcripts for publishing or SEO
  • Creating subtitles or captions from video audio
  • Translating spoken content to English
  • Converting voice notes and interviews into text

Best practices

  • Choose Fast Whisper Large V3 for speed and Whisper V3 Large for highest accuracy
  • Provide direct audio file URLs or extract audio from video first for best results
  • Enable timestamps when you need segment-level captions or timestamps for editing
  • Use sample input JSON for multi-field requests (timestamps, task) to avoid CLI quoting issues
  • Validate audio quality and prefer higher-bitrate sources for cleaner transcripts

Example use cases

  • Run a meeting recording through Fast Whisper Large V3 to create minutes and action items
  • Transcribe a podcast episode and publish the transcript alongside show notes
  • Extract audio from a recorded lecture, generate timestamped captions, and upload SRT
  • Translate a non-English interview to English text using Whisper V3 Large
  • Convert a collection of voice memos into searchable text for research

FAQ

Which model should I pick for real-time needs?

Use Fast Whisper Large V3 for faster responses; choose Whisper V3 Large when accuracy matters more than latency.

How do I get timestamped segments?

Include "timestamps": true in the input JSON and run the app; the response will include a "segments" array with start/end times.

Can the skill handle videos directly?

Yes — you can provide a video URL directly to the transcription app or first extract audio with the platform's video-audio-extractor app for cleaner results.