Honest, detailed technical articles from the development of DocAI: LLM inference tuning, GPU optimisation, document processing pipelines, enterprise AI architecture. Negative results and lessons learned included. New post every week.
How did Gree Hungary in Szarvas turn a multi-day approval process into same-day flow? 30–70 documents per day, automated partner verification against four authoritative public sources, 1+ hour daily savings on payments alone, 3–4 month payback — in the honest words of our first customer.
Gemma4 candidate eval on the DocAI Hungarian invoice KIE corpus: F1 0.890 vs Qwen3.6 1.000, single-stream decode 30-80% slower. But the speed measurement’s side effect revealed that MTP acceptance rate on the JSON-KIE workload is 99% — the previous article’s 72.5% global figure hid this completely. The DocAI workload is MTP’s architectural best case.
Qwen3.5-122B-A10B NVFP4 on a single Spark, vLLM 0.19.2, with MTP: 30 tok/s JSON-KIE at 100% MTP acceptance, 64 tok/s aggregate across 4 concurrent users — and the stress test that breaks the Spark when 100K concurrent contexts hit. Production-relevant memory budget, prefix caching, and an honest closing call: is it worth putting in DocAI?
Qwen3.6-35B-A3B-FP8 + MTP (multi-token prediction) benchmark on DGX Spark, GB10 chip. The vanilla model is just as fast as 3.5, but on the 16-concurrent stress test MTP delivered +24% throughput and −56% TTFT — exactly where spec decoding should have been negative in theory. The unexpected symbiosis of unified memory architecture and speculative decoding.
A two-day vLLM + Triton MoE tuning marathon on DGX Spark with Qwen3.5-35B-A3B-FP8. In the end the production config was 5-7% worse. What I learned about the difference between pure-kernel and serving benchmarks — plus six concrete takeaways you can apply.
The next article is already in the works. If you don’t want to miss it, subscribe through the contact form or check back next week.