home / skills / trevors / dot-claude / qwen3-tts-profile
This skill profiles and benchmarks qwen3-tts-rs inference inside a CUDA Docker container, producing Chrome Trace, timing breakdowns, and bottleneck insights.
npx playbooks add skill trevors/dot-claude --skill qwen3-tts-profileReview the files below or copy the command above to add this skill to your agents.
---
name: qwen3-tts-profile
description: |
Profile and benchmark qwen3-tts-rs inference. Runs e2e_bench with chrome
tracing, flamegraph, or Nsight Systems inside a CUDA Docker container.
Triggers on: "profile", "benchmark", "run profiling", "chrome trace",
"flamegraph", "nsys", "perf trace", "how fast", "what's slow",
"performance", "bottleneck"
---
# qwen3-tts-rs Profiling & Benchmarking
Run performance profiling and benchmarks for the qwen3-tts Rust TTS engine.
## Prerequisites
- Docker with `--gpus all` support
- `qwen3-tts:latest` Docker image (has Rust toolchain + CUDA)
- Model weights in `test_data/models/` (1.7B-CustomVoice is the default)
- `tokenizer.json` must be in the model directory
## Docker Execution Pattern
The CUDA toolchain lives inside the Docker container. All cargo commands must
run there. The workspace is bind-mounted at `/workspace`:
```bash
docker run --rm --gpus all --entrypoint /bin/bash \
-v "$(pwd):/workspace" -w /workspace \
qwen3-tts:latest \
-c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && <COMMAND>'
```
## Profiling Modes
### 1. Chrome Trace (default — best for span hierarchy)
Produces `trace.json` for viewing in `chrome://tracing` or <https://ui.perfetto.dev>.
```bash
docker run --rm --gpus all --entrypoint /bin/bash \
-v "$(pwd):/workspace" -w /workspace \
qwen3-tts:latest \
-c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
cargo run --profile=profiling --features=profiling,cuda,cli --bin e2e_bench -- \
--model-dir test_data/models/1.7B-CustomVoice --iterations 1 --warmup 1'
```
Output: `trace.json` (~12MB for 3 sentences). Contains spans:
- `generate_frames` — full generation loop
- `code_predictor` / `code_predictor_inner` — per-frame acoustic code generation
- `talker_step` — per-frame transformer forward pass
- `sampling` / `top_k` / `top_p` — per-frame token sampling
- `gpu_sync` trace events — marks every `to_vec1()` GPU→CPU sync
### 2. Per-Stage Timing (no profiling feature needed)
The e2e_bench binary reports stage breakdowns (prefill / generation / decode)
even without the `profiling` feature:
```bash
docker run --rm --gpus all --entrypoint /bin/bash \
-v "$(pwd):/workspace" -w /workspace \
qwen3-tts:latest \
-c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
cargo run --release --features=cuda,cli --bin e2e_bench -- \
--model-dir test_data/models/1.7B-CustomVoice --iterations 3 --warmup 1'
```
### 3. Streaming TTFA (Time to First Audio)
```bash
# Add --streaming flag
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
--iterations 3 --warmup 1 --streaming
```
### 4. JSON Output
```bash
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
--json-output results.json --iterations 3
```
## GPU Sync Audit
List all `to_vec1()` GPU→CPU synchronization points:
```bash
bash scripts/audit-gpu-syncs.sh
```
## Interpreting Results
### Stage Breakdown Table
```text
Label Words Wall (ms) Audio (s) RTF Tok/s Mem (MB) Prefill Generate Decode
short 13 5235.2 3.68 1.423 8.8 858 21ms (1%) 2724ms (71%) 1109ms (29%)
medium 53 23786.3 34.00 0.700 17.9 859 20ms (0%) 22694ms (95%) 1057ms (4%)
long 115 43797.4 60.96 0.718 17.4 864 19ms (0%) 41861ms (96%) 1886ms (4%)
```
Key metrics:
- **RTF < 1.0** = faster than real-time
- **Prefill**: Should be <50ms on GPU. If high, check embedding/attention.
- **Generation**: Dominates. ~18 GPU→CPU syncs per frame (16 code_predictor + 2 sampling).
- **Decode**: ConvNeXt decoder. Scales with frame count. ~4% for long text.
- **Tok/s**: Semantic tokens per second. Higher = better.
### Chrome Trace Analysis
In Perfetto/chrome://tracing:
1. Look for gaps between `talker_step` and `code_predictor` — that's CPU overhead
2. Check if `sampling` (top_k + top_p) is significant vs model forward passes
3. The `gpu_sync` events mark where GPU stalls waiting for CPU
### Optimization Targets
The ~18 `to_vec1()` calls per frame are the main bottleneck:
- 16 in code_predictor (argmax per acoustic code group)
- 2 in sampling (read sampled token)
Batch these to reduce GPU→CPU round-trips.
## Model Variants
| Model | Dir | Notes |
| ---------------- | ----------------------------------- | ------------------------------- |
| 1.7B-CustomVoice | `test_data/models/1.7B-CustomVoice` | Default benchmark target |
| 1.7B-Base | `test_data/models/1.7B-Base` | Voice cloning (needs ref audio) |
| 1.7B-VoiceDesign | `test_data/models/1.7B-VoiceDesign` | Text-described voices |
## Reference Baseline (1.7B-CustomVoice, CUDA)
From January 2025 on DGX (A100):
- Short (13 words): RTF 1.42, 8.8 tok/s
- Medium (53 words): RTF 0.70, 17.9 tok/s
- Long (115 words): RTF 0.72, 17.4 tok/s
- Prefill: ~20ms, Decode: ~1-2s, Generation: 71-96%
This skill profiles and benchmarks the qwen3-tts-rs inference pipeline inside a CUDA-enabled Docker container. It runs e2e_bench with chrome tracing, flamegraph, or Nsight Systems to produce trace.json, stage timings, and JSON results for performance analysis. Use it to identify GPU→CPU syncs, per-stage hotspots, and Time-to-First-Audio (TTFA) metrics.
The skill launches the qwen3-tts runtime inside a Docker image with the CUDA toolchain and runs the e2e_bench binary under different profiling modes. It can emit a Chrome trace (trace.json) showing span hierarchies, produce per-stage timing reports without profiling features, stream TTFA metrics, and export JSON results. It also includes a GPU sync audit to list all to_vec1() GPU→CPU synchronization points.
What prerequisites are required to run profiling?
Docker with --gpus all support and the qwen3-tts Docker image containing the Rust toolchain and CUDA. Place model weights and tokenizer.json in the model directory.
Which profiling mode should I start with?
Begin with Chrome Trace to get a detailed span hierarchy and gpu_sync markers. Use per-stage timing for fast iteration and JSON output for automated comparisons.