home / skills / rshvr / unofficial-cohere-best-practices / cohere-best-practices
This skill guides Cohere API usage by outlining model selection, configuration, error handling, cost optimization, and patterns for chat and RAG.
npx playbooks add skill rshvr/unofficial-cohere-best-practices --skill cohere-best-practicesReview the files below or copy the command above to add this skill to your agents.
---
name: cohere-best-practices
description: Production best practices for Cohere AI APIs. Covers model selection, API configuration, error handling, cost optimization, and architectural patterns for chat, RAG, and agentic applications.
---
# Cohere Best Practices Reference
## Official Resources
- **Docs & Cookbooks**: https://github.com/cohere-ai/cohere-developer-experience
- **API Reference**: https://docs.cohere.com/reference/about
## Model Selection Guide
| Use Case | Model | Notes |
|----------|-------|-------|
| General chat/reasoning | `command-a-03-2025` | Latest Command A model |
| RAG with citations | `command-r-plus-08-2024` | Excellent grounded generation |
| Cost-sensitive tasks | `command-r-08-2024` | Good balance of quality/cost |
| Embeddings (English) | `embed-english-v3.0` | Best for English-only |
| Embeddings (Multilingual) | `embed-multilingual-v3.0` | 100+ languages |
| Reranking | `rerank-v3.5` | Good balance |
| Reranking (Quality) | `rerank-v4.0-pro` | Best quality, slower |
| Reranking (Speed) | `rerank-v4.0-fast` | Optimized for latency |
## API Configuration Best Practices
### Use Client V2
```python
import cohere
# Correct: Use ClientV2 for all new projects
co = cohere.ClientV2()
# Deprecated: Don't use the old client
# co = cohere.Client() # Avoid
```
### Temperature Settings
```python
# For agents/tool calling - lower temperature for reliability
co.chat(model="command-a-03-2025", temperature=0.3, ...)
# For creative tasks - higher temperature
co.chat(model="command-a-03-2025", temperature=0.7, ...)
# For deterministic outputs - zero temperature
co.chat(model="command-a-03-2025", temperature=0, ...)
```
## Embedding Best Practices
### Always Specify input_type
```python
# For documents being indexed
doc_embeddings = co.embed(
texts=documents,
model="embed-english-v3.0",
input_type="search_document", # Critical!
embedding_types=["float"]
)
# For search queries
query_embedding = co.embed(
texts=[query],
model="embed-english-v3.0",
input_type="search_query", # Must match at query time
embedding_types=["float"]
)
```
> **Critical**: Mismatched `input_type` between indexing and querying will degrade search quality significantly.
## RAG Best Practices
### Two-Stage Retrieval Pattern
```python
# Stage 1: Broad retrieval with embeddings
candidates = vectorstore.similarity_search(query, k=30)
# Stage 2: Precise reranking
reranked = co.rerank(
model="rerank-v3.5",
query=query,
documents=[doc.page_content for doc in candidates],
top_n=5
)
# Use reranked results for generation
final_docs = [candidates[r.index] for r in reranked.results]
```
### Grounded Generation with Citations
```python
response = co.chat(
model="command-r-plus-08-2024",
messages=[{"role": "user", "content": question}],
documents=[
{"id": f"doc_{i}", "data": {"text": doc}}
for i, doc in enumerate(final_docs)
]
)
# Access citations
for citation in response.message.citations:
print(f"'{citation.text}' from {citation.sources}")
```
## Error Handling
```python
from cohere.core import ApiError
def safe_chat(messages, max_retries=3):
for attempt in range(max_retries):
try:
return co.chat(
model="command-a-03-2025",
messages=messages
)
except ApiError as e:
if e.status_code == 429: # Rate limit
time.sleep(2 ** attempt)
continue
elif e.status_code >= 500: # Server error
time.sleep(1)
continue
else:
raise
raise Exception("Max retries exceeded")
```
## Cost Optimization
1. **Use appropriate models**: Don't use Command A for simple tasks
2. **Batch embeddings**: Embed multiple texts in one call (up to 96 texts)
3. **Cache embeddings**: Store computed embeddings in a vector database
4. **Use reranking wisely**: Only rerank when quality matters
5. **Stream for UX**: Streaming doesn't cost more but improves perceived latency
## Production Checklist
- [ ] Use `ClientV2` for all API calls
- [ ] Set appropriate `temperature` for your use case
- [ ] Always specify `input_type` for embeddings
- [ ] Implement retry logic with exponential backoff
- [ ] Use two-stage retrieval for RAG
- [ ] Cache embeddings to reduce API calls
- [ ] Monitor token usage and costs
- [ ] Handle rate limits gracefully
This skill documents production best practices for using Cohere AI APIs across chat, embeddings, reranking, streaming, RAG, and agent workflows. It focuses on model selection, API configuration, error handling, cost optimization, and architectural patterns to build reliable, efficient systems. Use it as a compact reference when designing or auditing Cohere-powered applications.
The skill summarizes recommended models for common tasks and explains concrete API configuration choices (ClientV2, temperature ranges, embedding input_type). It describes retrieval pipelines (two-stage retrieval + reranking), grounded generation with citations, and pragmatic error-handling strategies like retries with exponential backoff. It also provides operational guidance on batching, caching, and monitoring to control cost and latency.
What model should I use for grounded generation with citations?
Use a Command R family model optimized for grounded outputs (e.g., command-r-plus-08-2024) and pass retrieved documents as the documents parameter for citation support.
How do I avoid poor search results after indexing?
Ensure the embedding input_type used at indexing matches the input_type at query time (e.g., search_document vs search_query); mismatches significantly reduce quality.