home / skills / trevors / dot-claude / qwen3-tts-profile

qwen3-tts-profile skill

/skills/qwen3-tts-profile

This skill profiles and benchmarks qwen3-tts-rs inference inside a CUDA Docker container, producing Chrome Trace, timing breakdowns, and bottleneck insights.

npx playbooks add skill trevors/dot-claude --skill qwen3-tts-profile

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.0 KB
---
name: qwen3-tts-profile
description: |
  Profile and benchmark qwen3-tts-rs inference. Runs e2e_bench with chrome
  tracing, flamegraph, or Nsight Systems inside a CUDA Docker container.
  Triggers on: "profile", "benchmark", "run profiling", "chrome trace",
  "flamegraph", "nsys", "perf trace", "how fast", "what's slow",
  "performance", "bottleneck"
---

# qwen3-tts-rs Profiling & Benchmarking

Run performance profiling and benchmarks for the qwen3-tts Rust TTS engine.

## Prerequisites

- Docker with `--gpus all` support
- `qwen3-tts:latest` Docker image (has Rust toolchain + CUDA)
- Model weights in `test_data/models/` (1.7B-CustomVoice is the default)
- `tokenizer.json` must be in the model directory

## Docker Execution Pattern

The CUDA toolchain lives inside the Docker container. All cargo commands must
run there. The workspace is bind-mounted at `/workspace`:

```bash
docker run --rm --gpus all --entrypoint /bin/bash \
  -v "$(pwd):/workspace" -w /workspace \
  qwen3-tts:latest \
  -c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && <COMMAND>'
```

## Profiling Modes

### 1. Chrome Trace (default — best for span hierarchy)

Produces `trace.json` for viewing in `chrome://tracing` or <https://ui.perfetto.dev>.

```bash
docker run --rm --gpus all --entrypoint /bin/bash \
  -v "$(pwd):/workspace" -w /workspace \
  qwen3-tts:latest \
  -c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
      cargo run --profile=profiling --features=profiling,cuda,cli --bin e2e_bench -- \
        --model-dir test_data/models/1.7B-CustomVoice --iterations 1 --warmup 1'
```

Output: `trace.json` (~12MB for 3 sentences). Contains spans:

- `generate_frames` — full generation loop
- `code_predictor` / `code_predictor_inner` — per-frame acoustic code generation
- `talker_step` — per-frame transformer forward pass
- `sampling` / `top_k` / `top_p` — per-frame token sampling
- `gpu_sync` trace events — marks every `to_vec1()` GPU→CPU sync

### 2. Per-Stage Timing (no profiling feature needed)

The e2e_bench binary reports stage breakdowns (prefill / generation / decode)
even without the `profiling` feature:

```bash
docker run --rm --gpus all --entrypoint /bin/bash \
  -v "$(pwd):/workspace" -w /workspace \
  qwen3-tts:latest \
  -c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
      cargo run --release --features=cuda,cli --bin e2e_bench -- \
        --model-dir test_data/models/1.7B-CustomVoice --iterations 3 --warmup 1'
```

### 3. Streaming TTFA (Time to First Audio)

```bash
# Add --streaming flag
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
    --iterations 3 --warmup 1 --streaming
```

### 4. JSON Output

```bash
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
    --json-output results.json --iterations 3
```

## GPU Sync Audit

List all `to_vec1()` GPU→CPU synchronization points:

```bash
bash scripts/audit-gpu-syncs.sh
```

## Interpreting Results

### Stage Breakdown Table

```text
Label  Words  Wall (ms)  Audio (s)  RTF    Tok/s  Mem (MB)  Prefill     Generate      Decode
short     13    5235.2      3.68   1.423    8.8      858   21ms (1%)  2724ms (71%)  1109ms (29%)
medium    53   23786.3     34.00   0.700   17.9      859   20ms (0%)  22694ms (95%)  1057ms (4%)
long     115   43797.4     60.96   0.718   17.4      864   19ms (0%)  41861ms (96%)  1886ms (4%)
```

Key metrics:

- **RTF < 1.0** = faster than real-time
- **Prefill**: Should be <50ms on GPU. If high, check embedding/attention.
- **Generation**: Dominates. ~18 GPU→CPU syncs per frame (16 code_predictor + 2 sampling).
- **Decode**: ConvNeXt decoder. Scales with frame count. ~4% for long text.
- **Tok/s**: Semantic tokens per second. Higher = better.

### Chrome Trace Analysis

In Perfetto/chrome://tracing:

1. Look for gaps between `talker_step` and `code_predictor` — that's CPU overhead
2. Check if `sampling` (top_k + top_p) is significant vs model forward passes
3. The `gpu_sync` events mark where GPU stalls waiting for CPU

### Optimization Targets

The ~18 `to_vec1()` calls per frame are the main bottleneck:

- 16 in code_predictor (argmax per acoustic code group)
- 2 in sampling (read sampled token)

Batch these to reduce GPU→CPU round-trips.

## Model Variants

| Model            | Dir                                 | Notes                           |
| ---------------- | ----------------------------------- | ------------------------------- |
| 1.7B-CustomVoice | `test_data/models/1.7B-CustomVoice` | Default benchmark target        |
| 1.7B-Base        | `test_data/models/1.7B-Base`        | Voice cloning (needs ref audio) |
| 1.7B-VoiceDesign | `test_data/models/1.7B-VoiceDesign` | Text-described voices           |

## Reference Baseline (1.7B-CustomVoice, CUDA)

From January 2025 on DGX (A100):

- Short (13 words): RTF 1.42, 8.8 tok/s
- Medium (53 words): RTF 0.70, 17.9 tok/s
- Long (115 words): RTF 0.72, 17.4 tok/s
- Prefill: ~20ms, Decode: ~1-2s, Generation: 71-96%

Overview

This skill profiles and benchmarks the qwen3-tts-rs inference pipeline inside a CUDA-enabled Docker container. It runs e2e_bench with chrome tracing, flamegraph, or Nsight Systems to produce trace.json, stage timings, and JSON results for performance analysis. Use it to identify GPU→CPU syncs, per-stage hotspots, and Time-to-First-Audio (TTFA) metrics.

How this skill works

The skill launches the qwen3-tts runtime inside a Docker image with the CUDA toolchain and runs the e2e_bench binary under different profiling modes. It can emit a Chrome trace (trace.json) showing span hierarchies, produce per-stage timing reports without profiling features, stream TTFA metrics, and export JSON results. It also includes a GPU sync audit to list all to_vec1() GPU→CPU synchronization points.

When to use it

  • You need actionable perf data for qwen3-tts inference on CUDA GPUs
  • Investigating why generation dominates end-to-end latency
  • Measuring Time-to-First-Audio for streaming playback
  • Validating optimization changes (reduced GPU→CPU syncs, batching)
  • Collecting baseline metrics across model variants or hardware

Best practices

  • Run inside the provided CUDA Docker image to ensure the correct Rust and CUDA toolchain are available
  • Start with Chrome Trace mode to inspect span hierarchies and GPU sync events
  • Use per-stage timing for quick regression checks without profiling overhead
  • Audit GPU→CPU syncs (to_vec1) and aim to batch or eliminate them to reduce round-trips
  • Collect results.json for automated comparisons and store traces for perf visualizations

Example use cases

  • Generate trace.json via chrome tracing to inspect gpu_sync events and identify CPU stalls between talker_step and code_predictor
  • Run per-stage timing to collect RTF, Tok/s, and memory for short/medium/long text workloads
  • Enable streaming to measure TTFA and validate interactive playback constraints
  • Run the GPU sync audit script to enumerate and prioritize to_vec1() calls for batching optimizations
  • Compare baseline numbers across model variants (1.7B-CustomVoice, 1.7B-Base, 1.7B-VoiceDesign) on target hardware

FAQ

What prerequisites are required to run profiling?

Docker with --gpus all support and the qwen3-tts Docker image containing the Rust toolchain and CUDA. Place model weights and tokenizer.json in the model directory.

Which profiling mode should I start with?

Begin with Chrome Trace to get a detailed span hierarchy and gpu_sync markers. Use per-stage timing for fast iteration and JSON output for automated comparisons.