Engineering blog

Deep-dives from the engine room

Honest, detailed technical articles from the development of DocAI: LLM inference tuning, GPU optimisation, document processing pipelines, enterprise AI architecture. Negative results and lessons learned included. New post every week.

Case study · · ~6 min

“A new colleague who never asks for lunch” — The Gree Hungary DocAI story

How did Gree Hungary in Szarvas turn a multi-day approval process into same-day flow? 30–70 documents per day, automated partner verification against four authoritative public sources, 1+ hour daily savings on payments alone, 3–4 month payback — in the honest words of our first customer.

Engineering · · ~16 min

I went looking at Gemma4, and found MTP

Gemma4 candidate eval on the DocAI Hungarian invoice KIE corpus: F1 0.890 vs Qwen3.6 1.000, single-stream decode 30-80% slower. But the speed measurement’s side effect revealed that MTP acceptance rate on the JSON-KIE workload is 99% — the previous article’s 72.5% global figure hid this completely. The DocAI workload is MTP’s architectural best case.

Engineering · · ~18 min

A 122B model on a single DGX Spark: measured for real

Qwen3.5-122B-A10B NVFP4 on a single Spark, vLLM 0.19.2, with MTP: 30 tok/s JSON-KIE at 100% MTP acceptance, 64 tok/s aggregate across 4 concurrent users — and the stress test that breaks the Spark when 100K concurrent contexts hit. Production-relevant memory budget, prefix caching, and an honest closing call: is it worth putting in DocAI?

Engineering · · ~15 min

Qwen3.6 delivered where it shouldn’t have

Qwen3.6-35B-A3B-FP8 + MTP (multi-token prediction) benchmark on DGX Spark, GB10 chip. The vanilla model is just as fast as 3.5, but on the 16-concurrent stress test MTP delivered +24% throughput and −56% TTFT — exactly where spec decoding should have been negative in theory. The unexpected symbiosis of unified memory architecture and speculative decoding.

Engineering · · ~12 min

Two days, six hours of Triton tuning, one GB10, and a whole lot of nothing

A two-day vLLM + Triton MoE tuning marathon on DGX Spark with Qwen3.5-35B-A3B-FP8. In the end the production config was 5-7% worse. What I learned about the difference between pure-kernel and serving benchmarks — plus six concrete takeaways you can apply.

Coming soon

New posts land weekly

The next article is already in the works. If you don’t want to miss it, subscribe through the contact form or check back next week.