Gemma4 vs Qwen3.6: MTP at 99% acceptance on JSON-KIE

12 days ago I wrote an article about how Qwen3.6 MTP delivered +24% throughput on the 16-concurrent stress test — exactly where it should theoretically have been neutral or negative. Back then I said the explanation was probably the GB10’s unified memory architecture: MTP draft verification fills unused compute in memory-bandwidth-bound batches.

This time the Gemma4 model family was released, and I figured I’d run a quick pass over it. Inside DocAI every new model release has to go through the candidate eval pipeline: it measures Hungarian KIE accuracy, measures speed, and decides — does it replace the prod model or not.

The short answer: no. Gemma4 scores F1=0.890 on the 34-document KIE corpus, Qwen3.6 scores F1=1.000 on the same one. Single-stream decode is 30-80% slower across the board. Prod stays where it is.

But that wouldn’t make a whole article. The story is that the speed measurement, as a side effect, surfaced something that was missing from the previous post: MTP acceptance rate varies dramatically per workload, and on the JSON-KIE workload it’s 99%. The 72.53% global rate from the previous article completely hid this. The DocAI canonical workload — structured JSON extraction from Hungarian invoices — is exactly the generation pattern multi-token prediction architecturally fits best.

If you only want the punchline: this isn’t luck, it’s designable. Details below.

The setup

spark-prod has been running on Qwen3.6-35B-A3B-FP8 with MTP since I published the previous article. The production config has changed in one small detail: max-model-len was raised from 131072 to 262144, and the 20.79 GiB KV pool sustains this 256K context at 6.85× concurrency. Everything else is the same: chunked prefill, prefix caching, fp8_e4m3 KV cache, MTP num_speculative_tokens=2.

spark-dev — which is otherwise the home of the Qwen3.5-122B-A10B-int4-AutoRound NVFP4 experiments — now hosts the Gemma4 candidate eval. Same hardware (DGX Spark, GB10 SM 12.1, 128 GB LPDDR5x), same vLLM version, same Docker stack. I started Gemma4 out-of-the-box on the upstream vLLM build with default settings — no MTP, no speculative decoding (the model card doesn’t mention any, and there is no qwen3_next_mtp-like draft head architecture in Gemma4 either).

Two things to look at:

KIE accuracy — can it correctly extract data from Hungarian invoices? F1 score on 34 documents.
Speed — single-stream decode tok/s on 6 different workloads (Q&A, Code, JSON-KIE, Hungarian prose, long RAG with 8K and 32K context).

I’m building the KIE eval harness right now. The lessons from this article have to repeat for every future candidate eval — Qwen3.7, Gemma5, or whatever lands next. Ground rule: every model is measured on the same corpus, with the same sampling config, with the same prompt template. The differences have to come from the model, not from the environment.

The KIE eval harness

The corpus consists of 40 Hungarian invoice documents, of which 34 are usable (6 stripped multimodal images — those would need a separate VL pipeline, which is out of scope for now). Some of the docs are simple header extraction (date, seller, buyer, tax IDs, totals, payment method), some are line-item extraction (__items subdir).

What the harness measures:

JSON validity — every model output is parsed; if it isn’t valid JSON, that’s already an error
Field-level F1 — field by field comparison against GT, counts TP/FN/FP and mismatches
Latency — subdir-level processing time
Token counts — input and output token averages

The string_strict type compares strings strictly, date_iso normalizes to ISO format, number_strict allows numerical tolerance. What’s missing for now: string_loose (case + whitespace tolerance), string_normalized (Unicode + diacritic normalization). Their absence will matter for some of the Gemma4 errors specifically, as we’ll see in a moment.

Two models, same corpus:

qwen36-baseline → http://10.10.0.4:8355  (prod model, with MTP)
gemma4          → http://10.10.0.4:8355  (model_override flag)

Sampling: temperature=0.0, max_tokens=8192, enable_thinking=false. The prod config serves in non-thinking mode, and the benchmark runs the same way.

KIE results

The full table:

Metric	Qwen3.6	Gemma4
Subdirs successful	34 / 34	34 / 34
JSON validity	100.0%	100.0%
Overall F1	1.000	0.890
Precision	1.000	0.888
Recall	1.000	0.892
TP / FN / FP / Mismatch	214 / 0 / 0 / 0	191 / 3 / 4 / 20
Latency (avg)	6687 ms	10396 ms
Input tokens (avg)	6734	5293
Output tokens (avg)	343	342

Something interesting jumps out immediately: Gemma4 eats fewer input tokens (5293 vs 6734) on the same corpus. And it’s still slower. This means the Gemma tokenizer encodes Hunganian text more efficiently — splits it into fewer tokens — but the decode is slower. The tokenizer-efficiency vs decode-speed split is a valuable metric: if you’re context-size-bound, tokenizer sensitivity can be a competitive advantage, and that one goes to Gemma4. But if throughput is the metric, Qwen3.6 wins.

Latency is 1.55× — that’s a lot, but not the whole picture. Latency is just the single-document processing time. In production a 16-concurrent batch would completely rewrite this — and Qwen3.6 (as I measured in the previous article) excels there. I didn’t run this concurrently this time, but single-stream is plenty for a candidate eval: if it’s this much worse single-stream, concurrent isn’t going to flip it.

0.890 F1 isn’t a bad number in absolute terms — on most public KIE benchmarks that’s respectable. But Qwen3.6 sits at 1.000 on the same corpus. The delta hurts.

Anatomy of Gemma4’s 27 mistakes

Gemma4 generated 27 mismatches. Going through them, they break into three categories.

Actual model errors (~12)

These are the painful ones. The model misread or hallucinated the field:

doc77 partner_name: “Baranyai Kéményseprő-ipari Szolgáltató Kft.” → “Baranyai Kéményseprô-ipari Szolgáltató Kft.” The difference is “ő” → “ô”. This is a Hungarian character encoding stumble, which is embarrassing in a Hungarian SME accounting system. NAV invoice validation rejects on a difference like that.
doc84 partner_name: “MVM Next Energiakereskedelmi Zrt.” → “MVM Next Energetikai Zrt.” Two different companies. This is hallucination.
doc84 invoice_number: 846802113701 → 84680211. Truncation — the model didn’t read the invoice number all the way through.
doc84 total_vat_amount: 1990 → 1990.5. total_gross_amount: 9594 → 9594.5. Decimal artefact — the model invented half forints.
doc109 payment_method: TRANSFER → CARD. Wrong category.
doc77 payment_date: 2025-07-10 → 2025-06-30. Different date.
doc94 payment_date: 2025-03-12 → null. Field lost.
doc94 payment_method: TRANSFER → OTHER.
doc71 partner_person_name and partner_bank_account both not extracted (fn).

These 12 errors are the part of the 27 that really matters. The other 15 are either systematic formatting differences or comparator weakness.

Systematic items pattern (~11)

This one is interesting. On every __items subdir Gemma4 scored F1=0.000 where Qwen scored 1.000. The cause: Gemma concatenates the period/timeframe into the line-item name field, which the prompt didn’t ask for:

Field	Qwen / GT	Gemma
name	“GitHub Copilot Usage”	“GitHub Copilot Usage Feb 01, 2026 - Feb 28, 2026”
name	“ChatGPT Plus Subscription”	“ChatGPT Plus Subscription Aug 22 - Sep 22, 2025”
name	“Premium plan”	“Premium plan Core K3 Ipari Szoftverek 2 Year May...”

The string_strict comparator drags this to 0. Whether this is “wrong” is debatable — from an information standpoint Gemma is giving you more, not less. But in the DocAI pipeline the canonical line-item name is what we need, because downstream item-pairing is built on it, and in the NAV invoice standard the line-item name is at most the description, not the full text. So from a production standpoint this is an error — it just isn’t the same weight as the “different company name” class. With a string_loose field-type the model would get partial credit, but for production data-standard purposes Qwen’s canonical form is the more valuable one.

GT and comparator weaknesses (~4)

Here Gemma’s choice is debatable, but the comparator rules strictly against it:

doc109 partner_taxnumber: GT “13826701-2-41 / HU13826701” — two values in one cell! Gemma gave the first variant. The comparator doesn’t know the “any of” relation.
doc94 partner_taxnumber: GT “10433748-2-44”, pred “HU10433748” — same tax number, two formats (HU domestic vs EU VAT).
doc70 invoice_number: GT null, pred ch_3Rnky7JFr6CCHwIi1zOC13KU — this is a Stripe charge ID, which is a valid invoice reference, just not the one we expected.
doc62/63/68 payment_method: GT null, pred “OTHER” — if GT says unknown and the model says “unknown” (= OTHER), it’s debatable whether that’s an fp.

If we handled these 4 in the comparator (alternative format support, multi-value field, null-vs-OTHER tolerance), Gemma F1 would rise to ~0.92. Still meaningfully worse than Qwen’s 1.000. The real 12 model errors stay, and they’re systematic in nature (Hungarian characters, hallucinated company name, truncated number).

Qwen3.6 carried this 34-doc corpus through flawlessly. Not a single hallucinated field, not a single ő/ô slip, not a single truncated number. That’s strong Hungarian language pretraining + strong instruction following.

Bottom line on the KIE part: Gemma4 takes the baseline from 1.000 down to 0.890, and a large portion of that is real model accuracy loss, not just comparator noise. We’re not swapping for this in production.

The speed results

Single-stream, decode tok/s, median of 3 runs:

Workload	Qwen3.6 + MTP	Gemma4	Qwen advantage
Q&A	54.1	40.5	+34%
Code	69.7	40.1	+74%
JSON-KIE	69.4	40.0	+74%
Hungarian	52.0	40.2	+30%
Long-RAG-8K	67.0	38.3	+75%
Long-RAG-32K	64.4	35.7	+80%

An interesting pattern: Gemma4 is strikingly stable at around 40 tok/s on short and medium workloads, dropping only slightly on the 32K context (35.7 tok/s). Qwen3.6, meanwhile, shows two speeds: 52-54 tok/s on “free prose” workloads (Q&A, Hungarian) and 67-70 tok/s on structured workloads (Code, JSON-KIE, Long-RAG).

What causes this duality? The MTP acceptance rate gives the answer.

The MTP surprise

The script logs the accept= field for every measured run — it pulls the vllm:spec_decode_num_accepted_tokens_total and _draft_tokens_total counters from the vLLM /metrics endpoint and computes the acceptance rate from the difference between the two snapshots. Per-workload acceptance for this run:

Workload	MTP Acceptance	Decode tok/s	Note
Q&A	60.81%	54.1	Open-ended Hungarian question
Code	97.66%	69.7	Writing a Python function
JSON-KIE	99.01%	69.4	Structured extraction
Hungarian	56.06%	52.0	Accounting definition prose
Long-RAG-8K	90.71%	67.0	Context-bound answer
Long-RAG-32K	89.33%	64.4	Context-bound, longer

And here’s the story. The 99% acceptance on JSON-KIE is not a measurement error — Code at 97.7% and Long-RAG at ~90% reinforce the same pattern. On structured, deterministic-output workloads the draft head almost always nails the next token.

The 72.53% global acceptance rate from the previous article is an average that hides how dramatically the workload pattern matters:

Free Hungarian/English prose (Q&A, Hungarian): 56-61%. High entropy, many valid continuations. The draft head misses often.
Structured program code (Code): 97.7%. The syntax constrains the next-token space so much (def , :, indent, return...) that the draft head almost always hits.
JSON extraction (JSON-KIE): 99.01%. Even more constrained. After {"name":, a string. After ,, a key. The grammar forces nearly every token.
Long-context RAG (Long-RAG): 89-91%. The answer derives from the context, which gives high predictability — even in free text the source material constrains generation.

This means the DocAI primary workload — structured JSON extraction from Hungarian invoices — isn’t accidentally happy with MTP. It’s the theoretical best case.

What this means

In the previous article the +24% throughput from the 16-concurrent stress test was a surprise. I wrote then: “in memory-bandwidth-bound batches MTP draft verification fills unused compute”. That’s true — but incomplete. The full picture:

vllm bench serve measures with random prompts, which is a mixed workload. The 72.5% acceptance came out on that. On a real production DocAI workload, where 90% of requests are JSON extraction from a structured schema, the actual acceptance will sit in the 95%+ range. Which means the 16-concurrent +24% is an underestimate of the real-world DocAI gain.

How much better? Hard to calculate precisely from bench numbers, but a back-of-envelope: if 72.5% acceptance produced +24% on the bench, and the real workload would be 95%+ acceptance, the actual gain creeps toward 30-35%. This needs to be measured once with a dedicated production-trace replay — I don’t have the setup for it now, next month, probably an article.

The other conclusion: MTP is not a universal win. If your workload is the “Hungarian prose” type here (chatbot for free Hungarian responses), the gain at 56% acceptance is much smaller. Between vanilla Qwen3.6 and MTP on the 52 tok/s Q&A there isn’t a big gap (maybe a 5-7% edge). Other optimizations bring more there.

A decision heuristic worth keeping:

Workload type	Expected MTP acceptance	MTP worth it
Structured output (JSON, XML, code)	95-99%	Yes, big gain
Long-context RAG (answer from context)	85-92%	Yes, good gain
Free Q&A, prose generation	55-65%	Marginal, worth measuring
Multi-turn dialog (high variance)	?	Measurement-dependent

The DocAI workload is in the first category. The prod config is optimal, this candidate eval reinforces it. Gemma4 isn’t replacing it, because there is no MTP equivalent and the baseline decode is also slower.

Other observations

Gemma4 is much more verbose than Qwen3.6 on the same prompt:

Workload	Qwen3.6 out_tok	Gemma4 out_tok	Note
Q&A	164	256	Gemma runs to max-token
Code	190	512	Gemma uses the full max-token
JSON-KIE	301	303	here they match
Hungarian	490	381	Qwen longer
Long-RAG-8K	196	177	similar
Long-RAG-32K	208	185	similar

The “Code” task — Write a Python function that performs binary search — produces a 190-token response on Qwen. Gemma4 fills the entire 512 tokens. It either gives duplicate examples, longer explanations, or rambles into itself. This is instruction-following quality — Qwen knows when a task is done. Gemma shows an “as long as max-token is left, I keep going” pattern.

Throughput numbers are therefore not strictly 1:1 comparable — a longer output takes more time, and Gemma4 keeps a steady speed while Qwen3.6 stops sooner. From a production standpoint shorter-but-correct is preferred, because the downstream pipeline then works from fewer tokens.

Gemma4’s input tokenizer efficiency, on the other hand, is better: average input on the 34-doc corpus is 5293 vs 6734 tokens. That’s ~21% fewer tokens for the same content. If you’re context-size-bound (long documents, KV cache pressure), the Gemma tokenizer is favorable. Worth slotting into the “tokenizer efficiency” tab for the next eval round.

TTFT, where it’s fair

Qwen3.6 only, fresh cold prefill (with nonce):

Workload	TTFT	Note
Q&A (~50 in)	0.13s	short prompt, MTP overhead
Code (~50 in)	0.13s	same
JSON-KIE (~150 in)	0.20s	structured prompt
Hungarian (~30 in)	0.14s	short Hungarian prompt
Long-RAG-8K	0.55s	8K context cold prefill
Long-RAG-32K	1.18s	32K context cold prefill

The 32K cold prefill at 1.18s on the GB10 is consistent with the earlier production benchmark — about 27K tok/s prefill throughput. The effective ratio at 8K is similar. These numbers will be the baseline for future candidate evals.

Lessons

Seven lessons, in order of importance:

1. Think about MTP acceptance rate at the workload level

The 72.5% global average hides that on structured-output workloads it’s 99%, on Hungarian prose it’s 56%. Measure MTP on the most important pattern of your workload, not on a benchmark mix. If you generate JSON, MTP is nearly free extra throughput. If you’re a Hungarian chatbot, marginal.

2. The DocAI workload is the MTP best case

This isn’t accidental, and it isn’t luck. Structured JSON extraction is the most predictable pattern for the draft head — every token constrained by the grammar. Qwen3.6 + MTP + DocAI workload is an architectural symbiosis.

3. Prefix caching kills your benchmark — turn it off or bust it

vLLM’s --enable-prefix-caching is a good thing in production, but in a benchmark every repeated prompt becomes a cache hit and the TTFT measurement loses meaning. Either stop the server and restart with --no-enable-prefix-caching, or prepend an 8-hex-char nonce to the system message. My 32K Long-RAG TTFT jumped from 0.19s to 1.18s after the fix. The 0.19s wasn’t a real speed.

4. Hungarian character encoding has to be tested by default on a new model

Gemma4’s “ő” → “ô” swap on a single field value decides whether a Hungarian invoice processing pipeline is acceptable. NAV invoice validation strictly checksums, and a single diacritic change leads to rejection. For Hungarian use cases every new model must be tested with Hungarian GT, not just “the M-Bench number is good”.

5. KIE F1 + decode tok/s are two orthogonal metrics

Accuracy and speed lose or win independently. Gemma4 is worse on both than Qwen3.6 here — which simplifies the decision. But in other cases (e.g. a slower-but-more-accurate model) the trade-off is explicit. The eval harness measures both, and the report shows both.

6. The `string_strict` comparator is brutal on free text

The 11 items mismatches on Gemma4 show that exact string match isn’t always useful. A field-type system is worth building: string_strict (IDs, numbers), string_loose (case+whitespace tolerance), string_normalized (Unicode normalization), enum_set (categorical fields). This is the next harness iteration, and without it the “Gemma 0.890” figure is partly a comparator artefact. But the 12 real model errors stay, and they aren’t going away.

7. The model candidate procedure works, and it has to stay

A year from now this article will, with numbers, back up the fact that the next candidate gets measured the same way — Qwen3.7, Gemma5, Llama5, or whatever lands next. Same corpus, same benchmark, same methodology. Delta numbers stay meaningful and don’t get tangled with methodology change. This discipline cannot be retrofitted — it has to be built now.

What else can be done

A handful of ideas that came up but didn’t fit this round:

Concurrent benchmark on Gemma4 too: benchmark.py needs a --concurrency N flag, and it should be run at 1, 4, 8, 16 on both models. It would show how Gemma4 scales in concurrent batch — my guess is worse than Qwen + MTP, but the numbers are needed.
MTP num_speculative_tokens=1 vs =2 vs =3: this was left open from the previous article. With 99% position-0 acceptance I’m curious how a =3 config performs on JSON-KIE.
Production trace replay: 24 hours of real DocAI traffic replayed with MTP and vanilla. This would give the real MTP gain on the production workload, not on the synthetic benchmark mix. Planned for August.
String-type system in the comparator: string_loose, string_normalized, enum_set. This would push Gemma4 F1 to ~0.92, the fairer number. Qwen still 1.000 though.
Multimodal corpus extension: the 6 stripped multimodal images skipped for now. Running the Qwen3-VL-Embedding-2B classifier + Qwen3.6 chain would bring those into the corpus. A next harness iteration.

Production decision

spark-prod stays on the Qwen3.6-35B-A3B-FP8 + MTP config. Gemma4 isn’t replacing it:

KIE F1 0.890 vs 1.000 — loss
Decode tok/s 30-80% slower across all workloads — loss
No MTP-equivalent spec decoding — structural disadvantage on the DocAI workload
Hungarian character encoding errors — risk

Gemma4 might be interesting for a different use case (shorter input tokens, better tokenizer efficiency), but it doesn’t perform in the DocAI candidate role.

Looking forward: the eval pipeline that just ran is reproducible, and every new release will go through it. The next candidate will be Qwen3.7 if it lands by year end, or Gemma5, or something new. The pipeline stays the same and the numbers will be year-over-year comparable.

System: NVIDIA DGX Spark, GB10 (SM 12.1), 128 GB LPDDR5x unified memory
Driver: NVIDIA 580.142, CUDA 13.0
Models: Qwen/Qwen3.6-35B-A3B-FP8 (prod, MTP num_speculative_tokens=2) vs Gemma4 (out-of-the-box, no spec decoding)
vLLM: 0.19.1rc1.dev328+ (cu130-nightly image)
Benchmark tool: custom benchmark.py (chat completions streaming, nonce prefix-cache busting)
KIE corpus: 34 Hungarian invoice documents (out of 40, 6 stripped multimodal skipped)

All JSON results, eval harness output, benchmark logs, and the patched version of benchmark.py are available. If you’re interested in reproduction, drop a line.