home / skills / rshvr / unofficial-cohere-best-practices / cohere-best-practices

cohere-best-practices skill

/skills/cohere-best-practices

This skill guides Cohere API usage by outlining model selection, configuration, error handling, cost optimization, and patterns for chat and RAG.

npx playbooks add skill rshvr/unofficial-cohere-best-practices --skill cohere-best-practices

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.3 KB
---
name: cohere-best-practices
description: Production best practices for Cohere AI APIs. Covers model selection, API configuration, error handling, cost optimization, and architectural patterns for chat, RAG, and agentic applications.
---

# Cohere Best Practices Reference

## Official Resources

- **Docs & Cookbooks**: https://github.com/cohere-ai/cohere-developer-experience
- **API Reference**: https://docs.cohere.com/reference/about

## Model Selection Guide

| Use Case | Model | Notes |
|----------|-------|-------|
| General chat/reasoning | `command-a-03-2025` | Latest Command A model |
| RAG with citations | `command-r-plus-08-2024` | Excellent grounded generation |
| Cost-sensitive tasks | `command-r-08-2024` | Good balance of quality/cost |
| Embeddings (English) | `embed-english-v3.0` | Best for English-only |
| Embeddings (Multilingual) | `embed-multilingual-v3.0` | 100+ languages |
| Reranking | `rerank-v3.5` | Good balance |
| Reranking (Quality) | `rerank-v4.0-pro` | Best quality, slower |
| Reranking (Speed) | `rerank-v4.0-fast` | Optimized for latency |

## API Configuration Best Practices

### Use Client V2
```python
import cohere

# Correct: Use ClientV2 for all new projects
co = cohere.ClientV2()

# Deprecated: Don't use the old client
# co = cohere.Client()  # Avoid
```

### Temperature Settings
```python
# For agents/tool calling - lower temperature for reliability
co.chat(model="command-a-03-2025", temperature=0.3, ...)

# For creative tasks - higher temperature
co.chat(model="command-a-03-2025", temperature=0.7, ...)

# For deterministic outputs - zero temperature
co.chat(model="command-a-03-2025", temperature=0, ...)
```

## Embedding Best Practices

### Always Specify input_type
```python
# For documents being indexed
doc_embeddings = co.embed(
    texts=documents,
    model="embed-english-v3.0",
    input_type="search_document",  # Critical!
    embedding_types=["float"]
)

# For search queries
query_embedding = co.embed(
    texts=[query],
    model="embed-english-v3.0",
    input_type="search_query",  # Must match at query time
    embedding_types=["float"]
)
```

> **Critical**: Mismatched `input_type` between indexing and querying will degrade search quality significantly.

## RAG Best Practices

### Two-Stage Retrieval Pattern
```python
# Stage 1: Broad retrieval with embeddings
candidates = vectorstore.similarity_search(query, k=30)

# Stage 2: Precise reranking
reranked = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=[doc.page_content for doc in candidates],
    top_n=5
)

# Use reranked results for generation
final_docs = [candidates[r.index] for r in reranked.results]
```

### Grounded Generation with Citations
```python
response = co.chat(
    model="command-r-plus-08-2024",
    messages=[{"role": "user", "content": question}],
    documents=[
        {"id": f"doc_{i}", "data": {"text": doc}}
        for i, doc in enumerate(final_docs)
    ]
)

# Access citations
for citation in response.message.citations:
    print(f"'{citation.text}' from {citation.sources}")
```

## Error Handling

```python
from cohere.core import ApiError

def safe_chat(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return co.chat(
                model="command-a-03-2025",
                messages=messages
            )
        except ApiError as e:
            if e.status_code == 429:  # Rate limit
                time.sleep(2 ** attempt)
                continue
            elif e.status_code >= 500:  # Server error
                time.sleep(1)
                continue
            else:
                raise
    raise Exception("Max retries exceeded")
```

## Cost Optimization

1. **Use appropriate models**: Don't use Command A for simple tasks
2. **Batch embeddings**: Embed multiple texts in one call (up to 96 texts)
3. **Cache embeddings**: Store computed embeddings in a vector database
4. **Use reranking wisely**: Only rerank when quality matters
5. **Stream for UX**: Streaming doesn't cost more but improves perceived latency

## Production Checklist

- [ ] Use `ClientV2` for all API calls
- [ ] Set appropriate `temperature` for your use case
- [ ] Always specify `input_type` for embeddings
- [ ] Implement retry logic with exponential backoff
- [ ] Use two-stage retrieval for RAG
- [ ] Cache embeddings to reduce API calls
- [ ] Monitor token usage and costs
- [ ] Handle rate limits gracefully

Overview

This skill documents production best practices for using Cohere AI APIs across chat, embeddings, reranking, streaming, RAG, and agent workflows. It focuses on model selection, API configuration, error handling, cost optimization, and architectural patterns to build reliable, efficient systems. Use it as a compact reference when designing or auditing Cohere-powered applications.

How this skill works

The skill summarizes recommended models for common tasks and explains concrete API configuration choices (ClientV2, temperature ranges, embedding input_type). It describes retrieval pipelines (two-stage retrieval + reranking), grounded generation with citations, and pragmatic error-handling strategies like retries with exponential backoff. It also provides operational guidance on batching, caching, and monitoring to control cost and latency.

When to use it

  • Designing chat or agent systems that need reliable reasoning and tool use
  • Building RAG systems that require faithful, cited answers
  • Indexing and searching large document collections with embeddings
  • Optimizing costs and latency in production ML pipelines
  • Implementing robust API error handling and rate-limit recovery

Best practices

  • Always use ClientV2 for new projects to ensure compatibility and feature support
  • Pick models by task: Command A for general chat, Command R variants for grounded or cost-sensitive RAG, specialized embedding models for language needs
  • Set temperature by intent: low (0–0.3) for deterministic agent/tool use, medium (~0.7) for creative tasks, zero for reproducible outputs
  • Specify input_type when creating and querying embeddings; mismatches degrade search quality
  • Adopt two-stage retrieval: wide embedding recall then precise reranking before generation
  • Implement retries with exponential backoff for 429/5xx errors and fail fast on client errors

Example use cases

  • Customer support agent that uses low-temperature chat plus tool calls and grounding via RAG
  • Knowledge base search: batch-embed documents (embed-english-v3.0) with input_type=search_document and rerank results for top-K answers
  • Multilingual semantic search using embed-multilingual-v3.0 and cached embeddings in a vector DB
  • Cost-conscious analytics pipeline: use cheaper reranking models or limit rerank usage to high-value queries
  • Streaming chat UI that improves perceived latency while reusing the same model choices and safety settings

FAQ

What model should I use for grounded generation with citations?

Use a Command R family model optimized for grounded outputs (e.g., command-r-plus-08-2024) and pass retrieved documents as the documents parameter for citation support.

How do I avoid poor search results after indexing?

Ensure the embedding input_type used at indexing matches the input_type at query time (e.g., search_document vs search_query); mismatches significantly reduce quality.