home / skills / scientiacapital / skills / groq-inference-skill

groq-inference-skill skill

safe

This skill enables ultra-fast real-time LLM inference with GROQ for chat, vision, audio, tools, and low-latency voice agents.

npx playbooks add skill scientiacapital/skills --skill groq-inference-skill

Review the files below or copy the command above to add this skill to your agents.

Files (9)

SKILL.md

9.6 KB

---
name: "groq-inference"
description: "Fast LLM inference with Groq API - chat, vision, audio STT/TTS, tool use. Use when: groq, fast inference, low latency, whisper, PlayAI TTS, Llama, vision API, tool calling, voice agents, real-time AI."
---

<objective>
Enable ultra-fast LLM inference (10-100x faster than standard providers) using GROQ API for real-time applications including chat, vision, audio (STT/TTS), tool use, and reasoning models. Critical for voice agents and low-latency AI.
</objective>

<quick_start>
**Basic chat with GROQ:**
```python
from groq import Groq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",  # Best all-around
    messages=[{"role": "user", "content": prompt}],
)
```

**Model selection:**
| Use Case | Model |
|----------|-------|
| General chat | `llama-3.3-70b-versatile` |
| Vision/OCR | `meta-llama/llama-4-scout-17b-16e-instruct` |
| STT | `whisper-large-v3` (GROQ-hosted, NOT OpenAI) |
| TTS | `playai-tts` |
</quick_start>

<success_criteria>
GROQ integration is successful when:
- Correct model selected for use case (see model table)
- API key in environment variable (`GROQ_API_KEY`)
- Retry logic with tenacity for rate limits
- Streaming enabled for real-time applications
- Async patterns used for parallel queries
- NOT using OpenAI (constraint: NO OPENAI)
</success_criteria>

<core_content>
Ultra-fast LLM inference for real-time applications. GROQ delivers 10-100x faster inference than standard providers.

## Quick Reference: Model Selection

| Use Case | Model ID | Context | Notes |
|----------|----------|---------|-------|
| **General Chat** | `llama-3.3-70b-versatile` | 128K | Best all-around |
| **Fast Chat** | `llama-3.1-8b-instant` | 128K | Simple tasks, fastest |
| **Vision/OCR** | `meta-llama/llama-4-scout-17b-16e-instruct` | 128K | Up to 5 images |
| **STT** | `whisper-large-v3` | 448 | GROQ-hosted (NOT OpenAI API) |
| **TTS** | `playai-tts` | - | Fritz-PlayAI voice |
| **Reasoning** | `meta-llama/llama-4-maverick-17b-128e-instruct` | 128K | Thinking models |
| **Tool Use** | `compound-beta` | - | Built-in web search, code exec |

## Core Patterns

### 1. Chat Completion (Basic + Streaming)

```python
import os
from groq import Groq, AsyncGroq

client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

def chat(prompt: str, system: str = "You are helpful.") -> str:
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_completion_tokens=1024,
    )
    return response.choices[0].message.content

# Streaming
def stream_chat(prompt: str):
    stream = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content
```

### 2. Vision / Multimodal

```python
import base64

def analyze_image(image_path: str, prompt: str) -> str:
    with open(image_path, "rb") as f:
        image_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.chat.completions.create(
        model="meta-llama/llama-4-scout-17b-16e-instruct",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
            ]
        }],
    )
    return response.choices[0].message.content

# URL-based: just pass {"url": "https://..."} instead of base64
```

### 3. Audio: Speech-to-Text (GROQ-Hosted Whisper)

> **Note:** Whisper on GROQ runs on **GROQ hardware** - NOT calling OpenAI's API.
> Whisper is an open-source model that GROQ hosts for fast inference.

```python
def transcribe(audio_path: str, language: str = "en") -> str:
    with open(audio_path, "rb") as f:
        result = client.audio.transcriptions.create(
            file=f,
            model="whisper-large-v3",  # GROQ-hosted, not OpenAI API
            language=language,
            response_format="verbose_json",  # Includes timestamps
        )
    return result.text

def translate_to_english(audio_path: str) -> str:
    with open(audio_path, "rb") as f:
        result = client.audio.translations.create(file=f, model="whisper-large-v3")
    return result.text
```

**Alternative STT Providers** (if you prefer non-Whisper options):
- **Deepgram** - Real-time streaming, lowest latency (`pip install deepgram-sdk`)
- **AssemblyAI** - High accuracy, speaker diarization (`pip install assemblyai`)
- See `voice-ai-skill` for Deepgram/AssemblyAI integration patterns

### 4. Audio: Text-to-Speech (PlayAI)

```python
def text_to_speech(text: str, output_path: str = "output.wav"):
    response = client.audio.speech.create(
        model="playai-tts",
        voice="Fritz-PlayAI",  # Also: Arista-PlayAI
        input=text,
        response_format="wav",
    )
    response.write_to_file(output_path)

# Streaming TTS
def stream_tts(text: str):
    with client.audio.speech.with_streaming_response.create(
        model="playai-tts", voice="Fritz-PlayAI", input=text, response_format="wav"
    ) as response:
        for chunk in response.iter_bytes(1024):
            yield chunk
```

**Alternative TTS Providers** (beyond GROQ's PlayAI):
- **Cartesia** - Ultra-low latency, emotional control (`pip install cartesia`)
- **ElevenLabs** - Most natural voices, voice cloning (`pip install elevenlabs`)
- **Deepgram** - Fast, cost-effective (`pip install deepgram-sdk`)
- See `voice-ai-skill` for Cartesia/ElevenLabs/Deepgram TTS integration patterns

### 5. Tool Use / Function Calling

```python
import json

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"]
        }
    }
}]

def chat_with_tools(prompt: str):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile", messages=messages, tools=tools, tool_choice="auto"
    )
    msg = response.choices[0].message

    if msg.tool_calls:
        for tc in msg.tool_calls:
            result = execute_function(tc.function.name, json.loads(tc.function.arguments))
            messages.extend([msg, {"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)}])
        return client.chat.completions.create(model="llama-3.3-70b-versatile", messages=messages, tools=tools).choices[0].message.content
    return msg.content
```

### 6. Compound Beta (Built-in Web Search + Code Exec)

```python
def compound_query(prompt: str):
    """Built-in tools: web_search, code_execution."""
    response = client.chat.completions.create(
        model="compound-beta",
        messages=[{"role": "user", "content": prompt}],
    )
    msg = response.choices[0].message
    # Access msg.executed_tools for tool results
    return msg.content
```

### 7. Reasoning Models

```python
def reasoning_query(prompt: str, format: str = "parsed"):
    """format: 'parsed' (structured), 'raw' (visible), 'hidden' (no thinking)"""
    response = client.chat.completions.create(
        model="meta-llama/llama-4-maverick-17b-128e-instruct",
        messages=[{"role": "user", "content": prompt}],
        reasoning_format=format,
    )
    msg = response.choices[0].message
    if format == "parsed" and hasattr(msg, 'reasoning'):
        return {"thinking": msg.reasoning, "answer": msg.content}
    return msg.content
```

### 8. Async Patterns

```python
async_client = AsyncGroq(api_key=os.environ.get("GROQ_API_KEY"))

async def async_chat(prompt: str) -> str:
    response = await async_client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

async def parallel_queries(prompts: list[str]) -> list[str]:
    import asyncio
    return await asyncio.gather(*[async_chat(p) for p in prompts])
```

## Rate Limits

| Tier | Requests/min | Tokens/min | Tokens/day |
|------|--------------|------------|------------|
| Free | 30 | 15,000 | 500,000 |
| Paid | 100+ | 100,000+ | Unlimited |

```python
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def reliable_chat(prompt: str) -> str:
    return chat(prompt)
```

## Integration Notes

- **Pairs with**: voice-ai-skill (Whisper STT + PlayAI TTS), langgraph-agents-skill
- **Complements**: trading-signals-skill (fast analysis), data-analysis-skill
- **Projects**: VozLux (voice agents), FieldVault-AI (document processing)
- **Constraint**: NO OPENAI - GROQ is the fast inference layer

## Environment Variables

```bash
GROQ_API_KEY=gsk_...  # Required - get from console.groq.com

# Optional multi-provider
ANTHROPIC_API_KEY=    # Claude for complex reasoning
GOOGLE_API_KEY=       # Gemini fallback
```

## Reference Files

- `reference/models-catalog.md` - Complete model catalog with specs
- `reference/audio-speech.md` - Whisper STT and PlayAI TTS deep dive
- `reference/vision-multimodal.md` - Multimodal and image processing
- `reference/tool-use-patterns.md` - Function calling and Compound Beta
- `reference/reasoning-models.md` - Thinking models and reasoning_format
- `reference/cost-optimization.md` - Batch API, caching, provider routing

Overview

This skill enables ultra-fast LLM inference using the GROQ API for real-time chat, vision, speech-to-text, text-to-speech, tool use, and reasoning. It is designed for voice agents, low-latency pipelines, and parallel/streaming workloads that require 10–100x faster inference than standard providers. The skill exposes synchronous and async patterns, streaming, model selection guidance, and examples for tool calling and multimodal tasks.

How this skill works

The skill wraps GROQ client patterns to call specific model endpoints (chat, audio.transcriptions, audio.speech, vision-capable chat models, and compound/tool models). It supports streaming responses for low latency, async clients for parallel queries, and function-calling/tool integration for safe external actions. Audio STT/TTS use GROQ-hosted Whisper and PlayAI models (explicitly not OpenAI). It includes retry logic, rate-limit handling, and model recommendations per task.

When to use it

Real-time chatbots and voice agents that require sub-second responses
Multimodal applications needing image analysis or OCR with Llama-4‑Scout
Low-latency STT pipelines using GROQ-hosted Whisper
Fast TTS for live agents using PlayAI voices
Tool-enabled workflows requiring web search or code execution via compound models
Batch or parallel inference tasks where async patterns reduce wall time

Best practices

Pick the model optimized for the task (see model table: llama-3.3-70b for general chat, whisper-large-v3 for STT, playai-tts for TTS).
Enable streaming for UI/voice playback and use chunked consumption to avoid buffering delays.
Use async/parallel patterns for concurrent requests and batching to lower cost.
Implement retry/backoff (tenacity) for rate limits and transient errors.
Keep API key in GROQ_API_KEY environment variable and avoid routing through OpenAI.
Use reasoning_format and parsed outputs when you need inspectable chain-of-thought data.

Example use cases

A live customer-support voice agent: STT via whisper-large-v3 → chat with llama-3.3-70b → TTS via playai-tts with streaming.
Document intake: upload images or PDFs, run multimodal prompts on llama-4-scout for OCR and structured extraction.
Real-time trading signal assistant: parallel short prompts against llama-3.1-8b-instant for low-latency decisions.
Tool-enabled research assistant: compound-beta runs web_search and code_execution for on-the-fly facts and computations.
Multi-user chat server using async Groq client and streaming completions to push tokens to clients as they arrive.

FAQ

Is GROQ calling OpenAI models?

No. STT/TTS and LLMs on GROQ run on GROQ hardware or hosted models; do not route through OpenAI.

Which model for low-latency chat?

Use llama-3.1-8b-instant for fastest short-chat throughput; use llama-3.3-70b-versatile for higher quality with modest latency tradeoffs.