Back to the blog

Qwen3.6 delivered where it shouldn’t have

Or how a “plain A/B benchmark” became today’s most surprising result

Two weeks ago I wrote an article about how 6 hours of Triton MoE tuning left my production config 10% worse. That was the pessimistic ending. Now Qwen3.6-35B-A3B-FP8 has landed, and I figured I’d run the same benchmark suite — maybe there’d be something to write about.

There is. Just not where I expected.

If you just want the punchline: Qwen3.6 is just as fast as 3.5. With one exception. Turning on MTP (multi-token prediction) on the 16-concurrent stress test delivered +24% throughput and −56% TTFT — exactly where spec decoding should have been NEGATIVE in theory. The remaining 5,000 words are optional.

The setup, slightly revised

Same DGX Spark as before, same GB10 (Blackwell, SM 12.1, 128 GB unified LPDDR5x), just with the new FP8 model this time. Goal: run the same 4-scenario benchmark suite (single decode, prefill-bound, concurrent stress, chat profile) on 3.6 and see whether switching to production is worth it.

The tests are identical to the previous article:

  • A: single decode — 512 input, 512 output, batch=1
  • B: prefill-bound — 8192 input, 256 output, 4 concurrent
  • C: concurrent contention — 2048 input, 512 output, 16 concurrent
  • D: chat profile — 2048 input, 256 output, 2 concurrent

Plus this time I ran every non-A test with two seeds (42 and 123), because in the phase7 article one seed of test C returned an outlier. Controlled variance, as any proper benchmark demands.

But the stack did change over the two weeks, which is worth logging before comparing anything:

ComponentTwo weeks agoNow
NVIDIA driver580.126.09580.142
CUDA13.013.0
vLLM version0.19.1rc1.dev2310.19.1rc1.dev328
Image SHAcu130-nightly (2 weeks old)cu130-nightly (fresh pull)

Driver is a minor patch, vLLM is ~100 commits ahead. These moved together and I can’t separate them. But before investigating 3.6, I first ran 3.5 on the fresh stack to get a clean baseline.

Phase 7-fresh: control experiment on 3.5

Without a control group the story doesn’t work. If I measure 3.6 on the fresh image but 3.5 on the old one, the difference will be a mix of model + stack. So same phase7 config, same 3.5 model, just fresh driver + fresh vLLM.

First surprise in the boot log: the KV pool grew from 174k tokens to 368k tokens at the same gpu-memory-utilization=0.45. 2.12× larger. VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 gave more accurate accounting, and the CUDA graph capture mode reconfigured itself — a new hybrid FULL_AND_PIECEWISE mode appeared, 11 piecewise + 7 full graphs.

The measurement:

Test2-week-old imageFresh stackΔ
A: tok/s49.1550.33+2.4%
B: Mean TTFT5776 ms4652 ms−19.5%
B: tok/s69.9674.73+6.8%
C: tok/s207.92216.30+4.0%
C: Mean TTFT4146 ms3989 ms−3.8%
D: tok/s72.2674.00+2.4%

The KIE workload prefill got 20% faster. In two weeks, with no config change. The driver+vLLM pair landed something whose biggest win falls on long-context prefill. A positive starting point for the 3.6 test.

One ugly detail: the A-test P99 TTFT jumped from 176 ms to 705 ms. P95 is 208 ms, so that’s a single outlier out of 20 requests — a cold-path init at the start of the benchmark that the 3 warmup requests didn’t cover. Not disruptive in production, but it’s in the JSON. I also watched for it on the 3.6 test, and it’s there too (687 ms). So this is benchmark methodology, not a model or stack effect. Parenthesised.

Main event, part one: Qwen3.6 vanilla

I port the same phase7 config to 3.6, only the model path changes. The only material edit to docker-compose: Qwen/Qwen3.5-35B-A3B-FP8Qwen/Qwen3.6-35B-A3B-FP8. Boot, watch the logs.

A few interesting observations:

1. vLLM resolves 3.6 as Qwen3_5MoeForConditionalGeneration too. Not a bug — the HF model card shows this, the config.json has it as model_type. This means from vLLM’s perspective 3.6 is essentially the 3.5 architecture retrained. Same loader, same kernels (CUTLASS FP8, Triton MoE, GDN prefill), same Mamba cache align logic, same MoE config warning (the E=256, N=512, device_name=NVIDIA_GB10 config still isn’t there).

2. 42 safetensors shards vs. 14. The checkpoint is the same size (34.89 GiB), but 3.6 is sliced into many more files. Consequence: weights loading takes 289 seconds instead of 87. 3.5 minutes longer at cold boot.

3. KV pool shrank a bit: 368k → 360k tokens (−2.3%). The 3.6 model card mentions a 256 Expert + 1 Shared architecture. The number of routed experts (256) and their size (512) are the same, but the new shared expert adds +1 dense FFN activation per token during profiling. Weights size didn’t grow, but max-batch forward-pass activation memory did.

4. Everything else is identical. Same FLASHINFER attention backend, same TRITON MoE, same Mamba align warning, same default sampling params from generation_config.json.

My expectation: mild TPOT regression from the shared expert, everything else unchanged. Let’s look:

TestQwen3.5 (phase7-fresh)Qwen3.6 vanillaΔ
A: tok/s50.3350.51+0.4%
A: Mean TPOT19.5119.45−0.3%
B: tok/s74.7373.98−1.0%
B: Mean TTFT46524742+1.9%
C: Mean TPOT66.3067.03+1.1%
C: tok/s216.30214.28−0.9%
D: tok/s74.0073.88−0.2%
D: Mean TPOT24.2324.36+0.6%

Almost everything is noise. 22 of 24 metrics are within ±2%. Test C’s TPOT is consistently +1.1% worse — that’s the shared expert’s silent tax, exactly as large as expected. The shared expert adds +1 dense FFN of compute per decode step. That’s measurable at the token level, barely visible at the throughput level.

I’d have stopped the article here if I hadn’t had an idea. Just before shutting down the instance, I remembered that 3.6 natively supports qwen3_next_mtp speculative decoding. The model card mentions it as a recommended production config for single-user paths.

Figured I’d test that too.

Main event, part two: MTP

Multi-Token Prediction: the model “predicts” 2-4 tokens per decode step, and the main model verifies them in parallel. If the acceptance rate is high, you get several free tokens per step. The traditional implementation (EAGLE, Medusa) needs separate draft model weights — Qwen3.6, however, ships with built-in MTP heads that share embedding and lm_head weights with the main model.

The config is minimal:

--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

vLLM immediately corrected me: the qwen3_next_mtp method was renamed to mtp on some quiet night, the HF docs became outdated accordingly, but the aliased redirect means the old name still works. The Qwen folks will update the docs, I hope.

The boot log added three things worth logging:

Detected MTP model. Sharing target model embedding weights with the draft model.
Detected MTP model. Sharing target model lm_head weights with the draft model.

The draft model is not another 35 GB — only the MTP heads (~500 MB) load extra, the embedding and lm_head are shared. Model loading took 35.02 GiB memory — +0.79 GiB total for MTP. Clever implementation.

CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend
(support: AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE); setting cudagraph_mode=PIECEWISE

Hidden tax: enabling MTP automatically downgrades CUDA graph mode to the weaker PIECEWISE. Vanilla phase8 used FULL_AND_PIECEWISE; this is a handicap before we even start.

GPU KV cache size: 323,456 tokens
Maximum concurrency for 131,072 tokens per request: 8.24x

KV pool −10% vs. vanilla: 360k → 323k tokens. MTP heads memory, padding layers, and PIECEWISE activation overhead together eat about 37k tokens. Still enough at max-num-seqs=32, but another hidden cost.

So: +500 MB weights, PIECEWISE graph mode, smaller KV pool. The MTP gains have to work against all of this.

The results

When I started the benchmark, on test A every 10 seconds a SpecDecoding metrics line appeared in the log. At 5:00:39 AM I saw:

Mean acceptance length: 2.50, Avg Draft acceptance rate: 74.9%

2.5 accepted tokens per draft (max possible: 2, but acceptance length counts the main token too, so ~1.5 extra tokens per step). A stable 70%+ acceptance rate is excellent — from spec decoding theory this should produce substantial throughput gains.

Full output for test A:

Acceptance rate (%):                     71.63
Acceptance length:                       2.43
Drafts:                                  4211
Draft tokens:                            8422
Accepted tokens:                         6033
Per-position acceptance (%):
  Position 0:                            81.81
  Position 1:                            61.46

Position 0 (the first spec token) accepted at 81.8% shows the draft head is well-trained — the model almost always guesses the next token correctly. Position 1 is only 61.5% — the second spec token is much more uncertain. With num_speculative_tokens=1 you’d get higher per-token efficiency, but in absolute terms 2-token spec gains more tokens.

After the run, the Prometheus metrics endpoint gave the global aggregate:

Total drafts:           41,568
Total draft tokens:     83,136
Total accepted tokens:  60,298
Position 0 accepted:    33,909  (81.57%)
Position 1 accepted:    26,389  (63.48%)
→ Overall acceptance: 72.53%

Consistent with the per-test numbers. 72.53% global acceptance rate is outstanding by spec-decoding standards. Below 50% it’s overhead, above 60% profitable, 70%+ is textbook territory. The Qwen team put serious work into training the MTP heads.

Now the throughput numbers:

TestQwen3.6 vanillaQwen3.6 + MTPΔ
A: Output tok/s50.5154.92+8.7%
A: Mean TPOT19.45 ms17.82 ms−8.4%
A: P99 TPOT19.52 ms21.44 ms+9.8%
B: Output tok/s73.9868.68−7.2%
B: Mean TTFT4742 ms2892 ms−39.0%
B: Mean TPOT35.66 ms46.62 ms+30.7%
B: P99 ITL~36 ms1053 msoutlier
C: Output tok/s214.28266.25+24.2%
C: Mean TTFT3979 ms1721 ms−56.7%
C: Mean TPOT67.03 ms55.58 ms−17.1%
D: Output tok/s73.8877.60+5.0%
D: Mean TTFT718 ms502 ms−30.1%

OK, stop. Four tests, four different stories.

Test A: what we expected

Single decode, batch=1. The textbook case for spec decoding. +8.7% throughput, −8.4% TPOT, 71.6% acceptance rate. That’s it. P99 TPOT went up (+9.8%), but that was also expected: when a draft gets rejected the whole forward pass is “wasted,” so tail latency gets a little worse in exchange for better mean.

If you’re serving single-user chat, turn it on. 9% throughput almost for free, the P99 tax is negligible.

Test D: the chat agent

2 concurrent users, 2k context, 256 output. +5% throughput, −30% TTFT. Moderate but consistent win. This is the “everyday chat” profile, where nobody would complain if it’s 5% faster, and the user experience noticeably improves from the TTFT drop.

Note: D-test Mean ITL is 55 ms while Mean TPOT is only 23.5 ms. The difference: TPOT measures inter-token time in the decode phase (ignoring prefill), while ITL measures the gap between all token pairs, including pauses between spec-decoding “bursts.” When multiple tokens arrive at once (e.g. 2 accepted spec tokens), one ITL is near 0 and the next is longer. TPOT is the smoother, more useful metric here.

Test B: the catastrophe, or maybe not

Mean TTFT: −39%, Mean TPOT: +30.7%, Output tok/s: −7.2%.

This is contradictory. TTFT (time to first token) improved, TPOT (time per output token) worsened. Aggregate throughput is mildly negative.

What’s going on? Test B is 8k input, 256 output, 4 concurrent. Long-context prefill is compute-heavy. MTP however adds per-token overhead: every decode step gets +1 draft forward, +2 spec verifications. With 4 concurrent requests, spec verification and main decode together need more compute than vanilla decode, and Mamba/GDN state updates don’t scale well. Result: the decode phase gets slower, the prefill (where MTP isn’t active) takes advantage of reduced contention, TTFT improves.

One more ugly detail: P99 ITL = 1053 ms, a full one-second spike. Mean ITL is 120 ms. This isn’t noise — both seeds give 1000+ ms P99 (1063 and 1043 ms). Hypothesis: a preemption event during the 4-concurrent prefill. When all 4 requests go into prefill phase at once, decode steps queue up, and MTP draft verification makes this worse — spec decoding is more fragile to preemption than a plain generate step.

Practical consequence: if your workload is long context + concurrent prefill (e.g. a DocAI KIE pipeline processing 4 documents in parallel), turn MTP off. Vanilla will be faster and more stable at the tail.

Test C: what I didn’t expect

Output tok/s: +24.2%. Mean TTFT: −56.7%. Mean TPOT: −17.1%.

This is something else entirely.

The expectation was that the 16-concurrent stress test would be neutral or negative with MTP. Concurrent batching already uses up the GPU in theory — spec verification is just overhead that can’t find room in the compute utilisation.

On test C the same numbers:

  • Acceptance rate 72.9% — decent, not exceptional
  • Mean acceptance length 2.44 — stable
  • 13,400 drafts per seed, 26,500 draft tokens per seed — plenty of sample

And the result: +24% throughput, −56% TTFT, −17% TPOT. Better in every direction.

I sat down and thought about it for a while. From where?

The answer, I think, is the GB10 unified memory architecture. DGX Spark doesn’t have dedicated VRAM; CPU and GPU share 128 GB of LPDDR5x. 16-concurrent batch decode is probably not compute-bound but memory-bandwidth-bound — the Triton MoE kernel fetches weights for 8 experts per token from LPDDR5x, and that bandwidth serialises. Compute-side capacity is left on the table.

MTP verification fills exactly that unused compute. Spec tokens verify the next step’s draft hypotheses in parallel, and since the memory transfer is already in flight (same weights), the extra compute is essentially “free.” Token generation rate jumps, TTFT drops because parallel requests spend less time queued (shorter decode phase → they clear faster).

This is DGX Spark specific. On an H100 or A100 the 16-concurrent batch fills the compute and MTP verification becomes overhead. On GB10 the compute/memory ratio is different, and that architectural property makes spec decoding disproportionately useful on this workload.

If that’s right, an interesting consequence: the more memory-bandwidth-bound the workload, the more MTP wins. Test C’s 2k context × 16 concurrent (= 32k active KV cache tokens per step) is exactly that. Test B’s 8k × 4 (= 32k too, but longer context → slower prefill → more scheduler contention) already behaves differently.

The big picture

Let’s look at the four tests together and turn them into a decision table for the article’s practical use:

WorkloadMTP?Reasoning
Single user (A)✅ Yes+9% throughput, minimal P99 tax
Chat, 2-4 users (D)✅ Yes+5% throughput, −30% TTFT
Concurrent stress (C)✅✅ Absolutely+24% throughput, −56% TTFT, surprise win
Long-context concurrent prefill (B)❌ NoTPOT regression, P99 ITL spikes

Test C’s surprise is what makes the Qwen3.6 switch worth it. Not the raw model speed — vanilla 3.6 and 3.5 are practically the same, ±1% apart. It’s the built-in MTP heads, which 3.5 couldn’t have matched (it doesn’t have them), and which make 3.6 measurably faster on the production chat agent workload.

So what did I actually learn?

Six lessons, in order of importance:

1. Spec decoding behaves differently on GB10

Classical wisdom says speculative decoding is good for single-user paths and bad under concurrent load. On the DGX Spark unified memory architecture this is inverted in the mid-concurrent regime. On the 16-concurrent test it brought +24% where on an H100 it would be zero or negative. Hardware-specific tuning of serving configs is a real thing, and common GPU advice (“spec decoding only helps single user”) may not be right on your hardware.

2. The shared expert is a silent tax

3.6’s new shared expert means +1 dense FFN forward per decode step. It shows up as a +1.1% TPOT regression on test C — exactly what you’d architecturally expect. Nobody measures this; the Qwen model card doesn’t mention it either. If your model-card benchmarks (SWE-bench, AIME, etc.) don’t improve by at least 1%-order-of-magnitude, the shared expert is more a tax than an investment.

3. vLLM resolves 3.6 as 3.5

The Qwen3_5MoeForConditionalGeneration resolve on loading 3.6 surprised me. It’s declared that way in config.json; vLLM handles it accordingly. This means optimisations written for 3.5 (tuned MoE config, if there were one, custom scheduler settings) work on 3.6 without changes. It also means 3.6-specific architectural changes (shared expert) don’t run on a separate code path, they’re implicit through the weights. Good news for drop-in replacement.

4. MTP’s memory impact is a tax

+500 MB weights, PIECEWISE CUDA graph mode (regression from FULL_AND_PIECEWISE), −10% KV pool. Understand the hidden costs, and if your KV pool was tight, raise gpu-memory-utilization or lower max-num-seqs before turning on MTP.

5. TPOT and ITL are not the same, especially with spec decoding

Mean ITL is 55 ms, mean TPOT is 23.5 ms on test D. The difference: TPOT refers to steady-state decode, while ITL measures every inter-token gap, including the “valleys” between spec-decoding bursts. If you’re measuring client UX (user’s perspective), ITL is interesting. If you’re measuring model capacity, TPOT. With spec decoding, don’t use them interchangeably.

6. Always run the control

If I’d measured only 3.6 on the fresh image and left the 3.5 phase7 numbers from the 2-week-old image, the “MTP concurrent win” wouldn’t have been clean — it would have mixed in the image change. The 25 minutes of phase7-fresh measurement paid off. Before launching into enthusiastic interpretation, measure what changed in the meantime.

Anatomy of the MTP acceptance rate

A few numbers worth knowing about Qwen3.6’s MTP in action:

MetricValue
Global acceptance rate72.53%
Position 0 (first spec token)81.57%
Position 1 (second spec token)63.48%
Average accepted length1.45 tokens per draft
Total draft tokens (4 tests, 8 seeds)83,136
Of which accepted60,298

The 63% at position 1 is interesting. With num_speculative_tokens=1, the system would verify only the first spec token — 81.6% acceptance, but only 1 potential extra token per step. With num_speculative_tokens=2, 72.5% mean but up to 2 extra tokens per step. Question: does 81.6% × 1 or 72.5% × 2 yield better throughput? On first glance the 2× is better (0.816 vs 1.450 extra tokens per step), but that also depends on compute cost — 2 spec tokens double the verification work.

I didn’t test num_speculative_tokens=1. Next iteration. By feel, 2 is close to the optimum on GB10; 3 would be too aggressive (position 2 would likely be around 40%, barely worth the compute).

What else could be done?

A few ideas that came up but I didn’t chase, because at some point the article has to close:

  • num_speculative_tokens=1 vs =2 vs =3 comparison: per the above.
  • MTP + max-num-batched-tokens combinations: in the previous article, 16k chunks made it into the production config. It would be interesting to see how 8k or 32k behave with MTP.
  • Intelligence / output quality measurement: this article is only about speed. The 3.6 vs 3.5 intelligence question (Hungarian KIE accuracy, tool calling reliability, reasoning) needs a dedicated eval harness, which I’m planning to build for DocAI. If 3.7 or Gemma5 lands, I’ll need objective numbers to know whether switching is worth it. That’s a separate article.
  • Spec decoding on 3.5 too?: 3.5 doesn’t ship built-in MTP, but EAGLE-2 or Medusa can be retrofitted. It would be interesting to see whether it delivers the same acceptance rate as a control. Not now.

Final production config

3.6 + MTP goes to production:

--max-model-len 131072
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.45
--max-num-seqs 32
--kv-cache-dtype fp8_e4m3
--enable-chunked-prefill
--enable-prefix-caching
--no-async-scheduling
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

The chat agent, the single-document KIE pipeline path, and the D-like concurrent chat workload all perform better this way. The B-like long-context concurrent prefill (e.g. 4 PDFs processed in parallel) is rare in DocAI workloads for now, but when it becomes more common it will run on a separate instance, without MTP. Dual config, two ports, split workload — the end of production simplicity, but for understandable reasons.

The lesson

At the end of my previous article, I wasted 6 hours on Triton MoE tuning and my production got 5-7% worse. I wrote the pessimism out of my system. Now Qwen3.6 landed, I ran the benchmark, got basically the same result, and then flipped a single flag — and on a workload where I expected regression (16-concurrent stress), +24% throughput came out.

The story isn’t “great new model, let’s switch.” Vanilla 3.6 and 3.5 are equally fast. The story is that the new model brought new capabilities, and among them MTP happens to work in symbiosis with the GB10 architecture — and nobody at the Qwen team, vLLM, or NVIDIA could have predicted the concrete numbers in advance. You had to measure.

Qwen3.6 isn’t good because it’s smarter (the Qwen numbers say it is, but that needs a separate eval harness). It’s good because flipping two flags makes DGX Spark concurrent throughput 24% higher. And if this reproduces on other Blackwell-generation unified-memory hardware (Jetson Thor, DGX B200), this isn’t an edge case.

Acknowledgements

Throughout the investigation, Claude was my partner in the terminal as log-parser and experiment-proposer. Benchmark script parameterisation, docker-compose variant creation, interpretation review — all on it. When I first looked at the phase8-mtp results, my reaction was “well, it ruined test B, and C, who knows why, is like this.” Claude was the one who held that the C +24% wasn’t a measurement error, and suggested the memory-bandwidth-bound hypothesis, which the numbers later confirmed.

If you need a partner for a two-hour benchmark marathon hunched over logs and JSON files, it’s a good choice.


System: NVIDIA DGX Spark, GB10 (SM 12.1), 128 GB LPDDR5x unified memory
Driver: NVIDIA 580.142, CUDA 13.0
Model: Qwen/Qwen3.6-35B-A3B-FP8 (on this day resolved by vLLM as Qwen3_5MoeForConditionalGeneration, 40 layers, E=256 MoE + 1 shared expert)
vLLM: 0.19.1rc1.dev328+g18013df6a (cu130-nightly image, pulled 2026-04-18, +pandas Dockerfile layer)
Benchmark tool: vllm bench serve (built-in)

All JSON results, docker-compose files, MTP acceptance metrics, and the phase7-fresh control measurement are available — if you’re interested in reproduction or the detailed percentile distributions, get in touch.