home / skills / scientiacapital / skills / groq-inference-skill
This skill enables ultra-fast real-time LLM inference with GROQ for chat, vision, audio, tools, and low-latency voice agents.
npx playbooks add skill scientiacapital/skills --skill groq-inference-skillReview the files below or copy the command above to add this skill to your agents.
---
name: "groq-inference"
description: "Fast LLM inference with Groq API - chat, vision, audio STT/TTS, tool use. Use when: groq, fast inference, low latency, whisper, PlayAI TTS, Llama, vision API, tool calling, voice agents, real-time AI."
---
<objective>
Enable ultra-fast LLM inference (10-100x faster than standard providers) using GROQ API for real-time applications including chat, vision, audio (STT/TTS), tool use, and reasoning models. Critical for voice agents and low-latency AI.
</objective>
<quick_start>
**Basic chat with GROQ:**
```python
from groq import Groq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
response = client.chat.completions.create(
model="llama-3.3-70b-versatile", # Best all-around
messages=[{"role": "user", "content": prompt}],
)
```
**Model selection:**
| Use Case | Model |
|----------|-------|
| General chat | `llama-3.3-70b-versatile` |
| Vision/OCR | `meta-llama/llama-4-scout-17b-16e-instruct` |
| STT | `whisper-large-v3` (GROQ-hosted, NOT OpenAI) |
| TTS | `playai-tts` |
</quick_start>
<success_criteria>
GROQ integration is successful when:
- Correct model selected for use case (see model table)
- API key in environment variable (`GROQ_API_KEY`)
- Retry logic with tenacity for rate limits
- Streaming enabled for real-time applications
- Async patterns used for parallel queries
- NOT using OpenAI (constraint: NO OPENAI)
</success_criteria>
<core_content>
Ultra-fast LLM inference for real-time applications. GROQ delivers 10-100x faster inference than standard providers.
## Quick Reference: Model Selection
| Use Case | Model ID | Context | Notes |
|----------|----------|---------|-------|
| **General Chat** | `llama-3.3-70b-versatile` | 128K | Best all-around |
| **Fast Chat** | `llama-3.1-8b-instant` | 128K | Simple tasks, fastest |
| **Vision/OCR** | `meta-llama/llama-4-scout-17b-16e-instruct` | 128K | Up to 5 images |
| **STT** | `whisper-large-v3` | 448 | GROQ-hosted (NOT OpenAI API) |
| **TTS** | `playai-tts` | - | Fritz-PlayAI voice |
| **Reasoning** | `meta-llama/llama-4-maverick-17b-128e-instruct` | 128K | Thinking models |
| **Tool Use** | `compound-beta` | - | Built-in web search, code exec |
## Core Patterns
### 1. Chat Completion (Basic + Streaming)
```python
import os
from groq import Groq, AsyncGroq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
def chat(prompt: str, system: str = "You are helpful.") -> str:
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
temperature=0.7,
max_completion_tokens=1024,
)
return response.choices[0].message.content
# Streaming
def stream_chat(prompt: str):
stream = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
```
### 2. Vision / Multimodal
```python
import base64
def analyze_image(image_path: str, prompt: str) -> str:
with open(image_path, "rb") as f:
image_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="meta-llama/llama-4-scout-17b-16e-instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
]
}],
)
return response.choices[0].message.content
# URL-based: just pass {"url": "https://..."} instead of base64
```
### 3. Audio: Speech-to-Text (GROQ-Hosted Whisper)
> **Note:** Whisper on GROQ runs on **GROQ hardware** - NOT calling OpenAI's API.
> Whisper is an open-source model that GROQ hosts for fast inference.
```python
def transcribe(audio_path: str, language: str = "en") -> str:
with open(audio_path, "rb") as f:
result = client.audio.transcriptions.create(
file=f,
model="whisper-large-v3", # GROQ-hosted, not OpenAI API
language=language,
response_format="verbose_json", # Includes timestamps
)
return result.text
def translate_to_english(audio_path: str) -> str:
with open(audio_path, "rb") as f:
result = client.audio.translations.create(file=f, model="whisper-large-v3")
return result.text
```
**Alternative STT Providers** (if you prefer non-Whisper options):
- **Deepgram** - Real-time streaming, lowest latency (`pip install deepgram-sdk`)
- **AssemblyAI** - High accuracy, speaker diarization (`pip install assemblyai`)
- See `voice-ai-skill` for Deepgram/AssemblyAI integration patterns
### 4. Audio: Text-to-Speech (PlayAI)
```python
def text_to_speech(text: str, output_path: str = "output.wav"):
response = client.audio.speech.create(
model="playai-tts",
voice="Fritz-PlayAI", # Also: Arista-PlayAI
input=text,
response_format="wav",
)
response.write_to_file(output_path)
# Streaming TTS
def stream_tts(text: str):
with client.audio.speech.with_streaming_response.create(
model="playai-tts", voice="Fritz-PlayAI", input=text, response_format="wav"
) as response:
for chunk in response.iter_bytes(1024):
yield chunk
```
**Alternative TTS Providers** (beyond GROQ's PlayAI):
- **Cartesia** - Ultra-low latency, emotional control (`pip install cartesia`)
- **ElevenLabs** - Most natural voices, voice cloning (`pip install elevenlabs`)
- **Deepgram** - Fast, cost-effective (`pip install deepgram-sdk`)
- See `voice-ai-skill` for Cartesia/ElevenLabs/Deepgram TTS integration patterns
### 5. Tool Use / Function Calling
```python
import json
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}]
def chat_with_tools(prompt: str):
messages = [{"role": "user", "content": prompt}]
response = client.chat.completions.create(
model="llama-3.3-70b-versatile", messages=messages, tools=tools, tool_choice="auto"
)
msg = response.choices[0].message
if msg.tool_calls:
for tc in msg.tool_calls:
result = execute_function(tc.function.name, json.loads(tc.function.arguments))
messages.extend([msg, {"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)}])
return client.chat.completions.create(model="llama-3.3-70b-versatile", messages=messages, tools=tools).choices[0].message.content
return msg.content
```
### 6. Compound Beta (Built-in Web Search + Code Exec)
```python
def compound_query(prompt: str):
"""Built-in tools: web_search, code_execution."""
response = client.chat.completions.create(
model="compound-beta",
messages=[{"role": "user", "content": prompt}],
)
msg = response.choices[0].message
# Access msg.executed_tools for tool results
return msg.content
```
### 7. Reasoning Models
```python
def reasoning_query(prompt: str, format: str = "parsed"):
"""format: 'parsed' (structured), 'raw' (visible), 'hidden' (no thinking)"""
response = client.chat.completions.create(
model="meta-llama/llama-4-maverick-17b-128e-instruct",
messages=[{"role": "user", "content": prompt}],
reasoning_format=format,
)
msg = response.choices[0].message
if format == "parsed" and hasattr(msg, 'reasoning'):
return {"thinking": msg.reasoning, "answer": msg.content}
return msg.content
```
### 8. Async Patterns
```python
async_client = AsyncGroq(api_key=os.environ.get("GROQ_API_KEY"))
async def async_chat(prompt: str) -> str:
response = await async_client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
async def parallel_queries(prompts: list[str]) -> list[str]:
import asyncio
return await asyncio.gather(*[async_chat(p) for p in prompts])
```
## Rate Limits
| Tier | Requests/min | Tokens/min | Tokens/day |
|------|--------------|------------|------------|
| Free | 30 | 15,000 | 500,000 |
| Paid | 100+ | 100,000+ | Unlimited |
```python
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def reliable_chat(prompt: str) -> str:
return chat(prompt)
```
## Integration Notes
- **Pairs with**: voice-ai-skill (Whisper STT + PlayAI TTS), langgraph-agents-skill
- **Complements**: trading-signals-skill (fast analysis), data-analysis-skill
- **Projects**: VozLux (voice agents), FieldVault-AI (document processing)
- **Constraint**: NO OPENAI - GROQ is the fast inference layer
## Environment Variables
```bash
GROQ_API_KEY=gsk_... # Required - get from console.groq.com
# Optional multi-provider
ANTHROPIC_API_KEY= # Claude for complex reasoning
GOOGLE_API_KEY= # Gemini fallback
```
## Reference Files
- `reference/models-catalog.md` - Complete model catalog with specs
- `reference/audio-speech.md` - Whisper STT and PlayAI TTS deep dive
- `reference/vision-multimodal.md` - Multimodal and image processing
- `reference/tool-use-patterns.md` - Function calling and Compound Beta
- `reference/reasoning-models.md` - Thinking models and reasoning_format
- `reference/cost-optimization.md` - Batch API, caching, provider routing
This skill enables ultra-fast LLM inference using the GROQ API for real-time chat, vision, speech-to-text, text-to-speech, tool use, and reasoning. It is designed for voice agents, low-latency pipelines, and parallel/streaming workloads that require 10–100x faster inference than standard providers. The skill exposes synchronous and async patterns, streaming, model selection guidance, and examples for tool calling and multimodal tasks.
The skill wraps GROQ client patterns to call specific model endpoints (chat, audio.transcriptions, audio.speech, vision-capable chat models, and compound/tool models). It supports streaming responses for low latency, async clients for parallel queries, and function-calling/tool integration for safe external actions. Audio STT/TTS use GROQ-hosted Whisper and PlayAI models (explicitly not OpenAI). It includes retry logic, rate-limit handling, and model recommendations per task.
Is GROQ calling OpenAI models?
No. STT/TTS and LLMs on GROQ run on GROQ hardware or hosted models; do not route through OpenAI.
Which model for low-latency chat?
Use llama-3.1-8b-instant for fastest short-chat throughput; use llama-3.3-70b-versatile for higher quality with modest latency tradeoffs.