home / skills / trevors / dot-claude / qwen3-tts-profile

qwen3-tts-profile skill

safe

This skill profiles and benchmarks qwen3-tts-rs inference inside a CUDA Docker container, producing Chrome Trace, timing breakdowns, and bottleneck insights.

npx playbooks add skill trevors/dot-claude --skill qwen3-tts-profile

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.0 KB

---
name: qwen3-tts-profile
description: |
  Profile and benchmark qwen3-tts-rs inference. Runs e2e_bench with chrome
  tracing, flamegraph, or Nsight Systems inside a CUDA Docker container.
  Triggers on: "profile", "benchmark", "run profiling", "chrome trace",
  "flamegraph", "nsys", "perf trace", "how fast", "what's slow",
  "performance", "bottleneck"
---

# qwen3-tts-rs Profiling & Benchmarking

Run performance profiling and benchmarks for the qwen3-tts Rust TTS engine.

## Prerequisites

- Docker with `--gpus all` support
- `qwen3-tts:latest` Docker image (has Rust toolchain + CUDA)
- Model weights in `test_data/models/` (1.7B-CustomVoice is the default)
- `tokenizer.json` must be in the model directory

## Docker Execution Pattern

The CUDA toolchain lives inside the Docker container. All cargo commands must
run there. The workspace is bind-mounted at `/workspace`:

```bash
docker run --rm --gpus all --entrypoint /bin/bash \
  -v "$(pwd):/workspace" -w /workspace \
  qwen3-tts:latest \
  -c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && <COMMAND>'
```

## Profiling Modes

### 1. Chrome Trace (default — best for span hierarchy)

Produces `trace.json` for viewing in `chrome://tracing` or <https://ui.perfetto.dev>.

```bash
docker run --rm --gpus all --entrypoint /bin/bash \
  -v "$(pwd):/workspace" -w /workspace \
  qwen3-tts:latest \
  -c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
      cargo run --profile=profiling --features=profiling,cuda,cli --bin e2e_bench -- \
        --model-dir test_data/models/1.7B-CustomVoice --iterations 1 --warmup 1'
```

Output: `trace.json` (~12MB for 3 sentences). Contains spans:

- `generate_frames` — full generation loop
- `code_predictor` / `code_predictor_inner` — per-frame acoustic code generation
- `talker_step` — per-frame transformer forward pass
- `sampling` / `top_k` / `top_p` — per-frame token sampling
- `gpu_sync` trace events — marks every `to_vec1()` GPU→CPU sync

### 2. Per-Stage Timing (no profiling feature needed)

The e2e_bench binary reports stage breakdowns (prefill / generation / decode)
even without the `profiling` feature:

```bash
docker run --rm --gpus all --entrypoint /bin/bash \
  -v "$(pwd):/workspace" -w /workspace \
  qwen3-tts:latest \
  -c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
      cargo run --release --features=cuda,cli --bin e2e_bench -- \
        --model-dir test_data/models/1.7B-CustomVoice --iterations 3 --warmup 1'
```

### 3. Streaming TTFA (Time to First Audio)

```bash
# Add --streaming flag
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
    --iterations 3 --warmup 1 --streaming
```

### 4. JSON Output

```bash
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
    --json-output results.json --iterations 3
```

## GPU Sync Audit

List all `to_vec1()` GPU→CPU synchronization points:

```bash
bash scripts/audit-gpu-syncs.sh
```

## Interpreting Results

### Stage Breakdown Table

```text
Label  Words  Wall (ms)  Audio (s)  RTF    Tok/s  Mem (MB)  Prefill     Generate      Decode
short     13    5235.2      3.68   1.423    8.8      858   21ms (1%)  2724ms (71%)  1109ms (29%)
medium    53   23786.3     34.00   0.700   17.9      859   20ms (0%)  22694ms (95%)  1057ms (4%)
long     115   43797.4     60.96   0.718   17.4      864   19ms (0%)  41861ms (96%)  1886ms (4%)
```

Key metrics:

- **RTF < 1.0** = faster than real-time
- **Prefill**: Should be <50ms on GPU. If high, check embedding/attention.
- **Generation**: Dominates. ~18 GPU→CPU syncs per frame (16 code_predictor + 2 sampling).
- **Decode**: ConvNeXt decoder. Scales with frame count. ~4% for long text.
- **Tok/s**: Semantic tokens per second. Higher = better.

### Chrome Trace Analysis

In Perfetto/chrome://tracing:

1. Look for gaps between `talker_step` and `code_predictor` — that's CPU overhead
2. Check if `sampling` (top_k + top_p) is significant vs model forward passes
3. The `gpu_sync` events mark where GPU stalls waiting for CPU

### Optimization Targets

The ~18 `to_vec1()` calls per frame are the main bottleneck:

- 16 in code_predictor (argmax per acoustic code group)
- 2 in sampling (read sampled token)

Batch these to reduce GPU→CPU round-trips.

## Model Variants

| Model            | Dir                                 | Notes                           |
| ---------------- | ----------------------------------- | ------------------------------- |
| 1.7B-CustomVoice | `test_data/models/1.7B-CustomVoice` | Default benchmark target        |
| 1.7B-Base        | `test_data/models/1.7B-Base`        | Voice cloning (needs ref audio) |
| 1.7B-VoiceDesign | `test_data/models/1.7B-VoiceDesign` | Text-described voices           |

## Reference Baseline (1.7B-CustomVoice, CUDA)

From January 2025 on DGX (A100):

- Short (13 words): RTF 1.42, 8.8 tok/s
- Medium (53 words): RTF 0.70, 17.9 tok/s
- Long (115 words): RTF 0.72, 17.4 tok/s
- Prefill: ~20ms, Decode: ~1-2s, Generation: 71-96%

Overview

This skill profiles and benchmarks the qwen3-tts-rs inference pipeline inside a CUDA-enabled Docker container. It runs e2e_bench with chrome tracing, flamegraph, or Nsight Systems to produce trace.json, stage timings, and JSON results for performance analysis. Use it to identify GPU→CPU syncs, per-stage hotspots, and Time-to-First-Audio (TTFA) metrics.

How this skill works

The skill launches the qwen3-tts runtime inside a Docker image with the CUDA toolchain and runs the e2e_bench binary under different profiling modes. It can emit a Chrome trace (trace.json) showing span hierarchies, produce per-stage timing reports without profiling features, stream TTFA metrics, and export JSON results. It also includes a GPU sync audit to list all to_vec1() GPU→CPU synchronization points.

When to use it

You need actionable perf data for qwen3-tts inference on CUDA GPUs
Investigating why generation dominates end-to-end latency
Measuring Time-to-First-Audio for streaming playback
Validating optimization changes (reduced GPU→CPU syncs, batching)
Collecting baseline metrics across model variants or hardware

Best practices

Run inside the provided CUDA Docker image to ensure the correct Rust and CUDA toolchain are available
Start with Chrome Trace mode to inspect span hierarchies and GPU sync events
Use per-stage timing for quick regression checks without profiling overhead
Audit GPU→CPU syncs (to_vec1) and aim to batch or eliminate them to reduce round-trips
Collect results.json for automated comparisons and store traces for perf visualizations

Example use cases

Generate trace.json via chrome tracing to inspect gpu_sync events and identify CPU stalls between talker_step and code_predictor
Run per-stage timing to collect RTF, Tok/s, and memory for short/medium/long text workloads
Enable streaming to measure TTFA and validate interactive playback constraints
Run the GPU sync audit script to enumerate and prioritize to_vec1() calls for batching optimizations
Compare baseline numbers across model variants (1.7B-CustomVoice, 1.7B-Base, 1.7B-VoiceDesign) on target hardware

FAQ

What prerequisites are required to run profiling?

Docker with --gpus all support and the qwen3-tts Docker image containing the Rust toolchain and CUDA. Place model weights and tokenizer.json in the model directory.

Which profiling mode should I start with?

Begin with Chrome Trace to get a detailed span hierarchy and gpu_sync markers. Use per-stage timing for fast iteration and JSON output for automated comparisons.