Back to the blog

A 122B model on a single DGX Spark: measured for real

Qwen3.5-122B-A10B NVFP4 on a single Spark — how far it goes, where it breaks, and is it worth putting into DocAI

Ever since the February release of Qwen3.5-122B-A10B, the same question kept popping up in community threads: is it worth running the 122B-parameter MoE on a single Spark, or should we stick with the 35B-A3B? For three months the answer was uncertain — or rather, nobody had connected exactly what changes the vLLM build, the checkpoint, MTP, NVFP4 quantization and the gpu_memory_utilization flag bring.

Now I’ve connected them.

This article documents what loading a 76 GB model into 121 GiB of unified memory on a GB10 chip looks like, how much MTP gives across different workloads (spoiler: not the same everywhere), why vLLM will autoselect the FLASHINFER_CUTLASS MoE backend — despite older threads claiming it would fail on SM121 — and what exactly “the Spark handles 4 concurrent 100K requests” means in practice.

Unlike previous articles, this is a single-Spark measurement. No cluster, no second machine, no tricks — just one DGX Spark, one NVFP4 checkpoint, and one production-relevant question: could DocAI run on this setup?

All numbers were measured on my own Spark on 25 April 2026, with the Sehyo/Qwen3.5-122B-A10B-NVFP4 checkpoint and the vLLM 0.19.2rc1.dev154+g1c2c1eb8b prebuilt wheel from eugr/spark-vllm-docker.

The landscape, briefly

There are currently three live NVFP4-quantized 122B-A10B checkpoints:

  • Sehyo/Qwen3.5-122B-A10B-NVFP4 — community-tested, quantized with vLLM’s llm-compressor, ships the 5 GB BF16 MTP weights inside extra_weights.safetensors. 81.5 GB.
  • txn545/Qwen3.5-122B-A10B-NVFP4 — quantized with NVIDIA Model Optimizer, MTP weights NOT included; manual merge required from the original 234 GB BF16 model. 75.6 GB.
  • RedHatAI/Qwen3.5-122B-A10B-NVFP4 — the official NVIDIA version, llm-compressor + save_mtp_tensors_to_checkpoint(). Similar size.

I picked Sehyo for two reasons: the MTP weights are already inside (saving an hour of extract-and-merge work), and most cross-validated community measurements are based on this checkpoint. txn545 might give slightly better quantization quality (NVIDIA’s official toolkit), but I’ll defer that for a later round when I have a Hungarian quality eval harness — for now the question is speed, and speed is bottlenecked by the architecture, not the quantizer.

The build I think everyone should be using

The community has been evangelizing eugr/spark-vllm-docker for half a year, and there’s a reason. The stock vLLM cu130-nightly image does not build the CUTLASS NVFP4 MoE kernels for SM121 — there is no precompiled FP4 GEMM for the GB10 architecture. The eugr build fixes exactly this: TORCH_CUDA_ARCH_LIST=12.1a, prebuilt FlashInfer and vLLM wheels from an automated pipeline, plus a mods/ directory of model-specific patches.

The build itself takes 2-3 minutes if the base image (vllm/vllm-openai:cu130-nightly) is already on the host. A source build (40 min FlashInfer + 20 min vLLM) is only needed if you want to test a custom commit or PR.

git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh

That’s it. The image is saved as vllm-node.

In the current state of the repo, recipes/qwen3.5-122b-fp8.yaml has cluster_only: true and tensor_parallel: 2 — meaning the maintainer sized the 122B FP8 for two Sparks. The Sehyo NVFP4 at ~78 GB, however, fits comfortably on a single Spark, so I ignored the recipe and started the vllm-node image directly with a custom vllm serve command.

A surprise: FLASHINFER_CUTLASS on SM121

Here came the first interesting bit. Community threads (kanthai/openclaw-spark, JungkwanBan/SPARK_Qwen3.5-122B-A10B-NVFP4) consistently say that the FlashInfer CUTLASS MoE FP4 backend fails on SM121, because the cvt.e2m1x2 PTX instruction isn’t supported on GB10. The workaround: VLLM_USE_FLASHINFER_MOE_FP4=0, fall back to the native cutlass_moe_fp4 path.

vLLM 0.19.2 + the fresh eugr build tells a different story. From the startup log:

INFO [nvfp4.py:283] Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends:
['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED',
 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].

vLLM autoselected FLASHINFER_CUTLASS, no environment variable, no forcing, nothing.

The repo’s git log shows the reason: d49fac1 Re-enable flashinfer_cutlass. The eugr build now pulls a FlashInfer/CUTLASS version that compiles for SM121. The community lore is therefore stale, and old forum threads should not be taken at face value. What was true six months ago isn’t necessarily true today.

Another gotcha: huggingface-cli deprecated

Small but non-obvious: huggingface_hub 1.x, released in early 2026, removed the huggingface-cli command and replaced it with hf. huggingface-cli download ... prints an empty log and exits — you have to use the new CLI:

hf auth login   # mandatory due to the xet rate limit, even for public repos
hf download Sehyo/Qwen3.5-122B-A10B-NVFP4 \
   --local-dir /opt/vllm/qwen35-122B/models/Qwen3.5-122B-A10B-NVFP4-Sehyo

The xet backend (HF’s new CAS system) rate-limits anonymous clients — it throws 416 Range Not Satisfiable errors even for public repos if you’re not logged in. A quick hf auth login with a token solves it.

Downloading the 76 GB took ~15 minutes on a gigabit connection, authenticated.

Anatomy of a cold start

docker run -d --name vllm-122b-nvfp4 \
  --runtime=nvidia --gpus all --privileged --network host --ipc=host \
  -v /opt/vllm/qwen35-122B/models/Qwen3.5-122B-A10B-NVFP4-Sehyo:/models/qwen35-122b-nvfp4:ro \
  -v ~/.cache/vllm:/root/.cache/vllm \
  -v ~/.cache/flashinfer:/root/.cache/flashinfer \
  -v ~/.triton:/root/.triton \
  -e NCCL_IGNORE_CPU_AFFINITY=1 \
  vllm-node \
  vllm serve /models/qwen35-122b-nvfp4 \
    --served-model-name qwen35-122b-nvfp4 \
    --port 8000 --host 0.0.0.0 \
    --max-model-len 131072 \
    --max-num-seqs 4 \
    --max-num-batched-tokens 8192 \
    --gpu-memory-utilization 0.90 \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
    --default-chat-template-kwargs '{"enable_thinking": false}'

Timing breakdown of the startup phases:

PhaseTime (cold)Time (warm cache)Note
CUDA + tokenizer init~20s~20sConstant
Loading weights (76 GB)501s557sDisk-bound, EXT4
Drafter (MTP) loading73s84s5 GB BF16, same source
torch.compile (backbone)36.5s10.1sCache hit substantial
torch.compile (eagle_head)6.6s0.25sCache hit huge
FlashInfer autotune~10s~8sPer-config tuning
CUDA graph capture12-17s11-15sPIECEWISE mode
Total~12 min~12 min

Two things stand out.

First: the 76 GB weight load takes 8 min 21 sec — and this doesn’t improve from cache. vLLM tells you so explicitly in the startup log:

Filesystem type for checkpoints: EXT4. Checkpoint size: 75.89 GiB.
Available RAM: 41.40 GiB.
Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized
network FS (NFS/Lustre) and the checkpoint size (75.89 GiB) exceeds 90% of
available RAM (41.40 GiB).

Of the 121 GiB unified memory, only 41 GiB is available for file cache at startup (the vLLM container, other processes, and runtime structures take the rest). The 76 GB checkpoint doesn’t fit into the 41 GB cache, so every restart re-reads the entire disk. On NFS or Lustre, vLLM’s auto-prefetch could speed this up; on EXT4 it can’t.

Second: the torch.compile cache makes a real difference. Backbone compile drops from 36.5s → 10.1s, and the draft head from 6.6s → 0.25s. These are absolutely worth keeping in a mounted volume (~/.cache/vllm), otherwise every restart wastes 40+ seconds.

From a production perspective, a 12-minute startup is no worse than a normal microservice deploy window, but it does need a deploy strategy: a blue-green-style flow where the new container boots, smoke-tests, and only then takes traffic.

Memory budget in the 121 GiB unified pool

Model weights (NVFP4):           76.02 GiB
KV cache (fp8, 128K, 4 seq):     30.33 GiB    → 595,584 tokens
CUDA graph memory:                0.27 GiB
torch.compile / FlashInfer:      ~5 GiB
Container + Python + scheduler:  ~5 GiB
─────────────────────────────────────────
Total used:                     ~117 GiB

With gpu-memory-utilization 0.90, vLLM reserves 121.7 × 0.9 ≈ 110 GiB for itself, and from this it gets the 30 GiB KV cache pool. Bumping it to 0.92 would buy you +1 GiB of KV, but at OOM risk, because CUDA graph capture allocates at runtime.

The 595K-token KV pool is enough for 4 concurrent 128K contexts — 4 × 131072 = 524K, and vLLM’s block allocator distributes dynamically. From the startup log:

Maximum concurrency for 131,072 tokens per request: 12.89x

Which says: if requests run with ~50% average context utilization (which is realistic in the real world), it can handle ~13 concurrent requests. Worst case (everyone at max) ~4-5×.

Single-stream tok/s — 5 workloads, 4 setups

Here’s the headline summary from single-stream (concurrency=1) measurements. My benchmark script does 1 warmup + 3 measured runs per workload and reports the median. MTP-2 enabled everywhere, thinking off (--default-chat-template-kwargs '{"enable_thinking": false}'), temperature=0.

Workload32K default32K tuned128K tunedDescription
Q&A24.725.225.620-token Hungarian prompt, 256 tokens output
Code30.129.730.1Python binary search request, 300 tokens
JSON-KIE31.030.730.9Hungarian invoice extraction into JSON
Hungarian23.723.924.6200-word Hungarian summary
Long-RAG-8K26.925.726.18K context + question, 256 tokens
Long-RAG-32Kn/an/a26.430K context + question, 256 tokens

The differences between the three setups:

  • 32K default: --max-model-len 32768, --max-num-batched-tokens 4096, no prefix caching
  • 32K tuned: --max-num-batched-tokens 8192, --enable-prefix-caching added
  • 128K tuned: same, with --max-model-len 131072

Two surprising observations.

First: the 128K setup is just as fast (slightly faster, even) as the 32K setup on small-context requests. I didn’t expect this — raising max_model_len should in theory allocate more KV blocks and complicate scheduling. In practice you don’t feel it. Production recommendation: --max-model-len 131072 once, don’t run a separate 32K instance. The unified memory budget barely changes either (30.33 GiB KV vs ~31 GiB at 32K).

Second: tuning gives almost nothing on single-stream. The --max-num-batched-tokens 4096 → 8192 change moves within ±1%. The vLLM warning at the default 4096 said it was suboptimal with MTP — that matters under concurrent load, not single-stream.

--enable-prefix-caching, however, hit hard in one place: it dramatically cut Long-RAG TTFT.

TTFT — where prefix caching earns its keep

Workload32K default TTFT32K tuned TTFT128K tuned TTFT
Q&A (~20 tok)0.27s0.27s0.27s
Code (~30 tok)0.28s0.28s0.28s
JSON-KIE (~200 tok)0.44s0.44s0.44s
Hungarian (~30 tok)0.30s0.29s0.29s
Long-RAG-8K (~7500 tok)4.41s2.59s2.55s
Long-RAG-32K (~30K tok)n/an/a3.87s (warm) / 18.11s (cold)

On Long-RAG-8K, TTFT goes 4.41s → 2.59s (-41%) just from the --enable-prefix-caching flag. The benchmark script sends the same long prompt 4 times in a row, so runs 2-3-4 are prefix cache hits, and part of the TTFT spent on tokenizing the fixed prefix is saved.

Long-RAG-32K’s first cold-run TTFT is 18.11 seconds, prefix-cache-hit 3.87s — vLLM caches almost the entire 30K-context prefix and only re-prefills the last few (non-fixed) tokens. That’s a 5× gain, and in real-world RAG (where the system prompt + the leading parts of retrieved chunks often contain fixed elements) it gives a tangible improvement in user-perceived latency.

vLLM emits a warning for the Mamba-cache + prefix-caching combo:

Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration
... Its support for Mamba layers is experimental.

In the Qwen3.5 hybrid GatedDeltaNet architecture, prefix-caching the Mamba layers is experimental. I didn’t see quality regressions in the measurements (Long-RAG-8K outputs are sensible), but in production it’s worth documenting the fixed prefix of the chat_template_kwargs and system prompt — because if it changes request-to-request, the prefix cache can’t hit.

MTP acceptance — workload-dependent, severely

This may be the most interesting data in the article. The MTP acceptance rate (how many draft tokens are accepted) depends dramatically on the workload type:

WorkloadMTP-2 acceptance (single-stream)
JSON-KIE100.00% (3/3 runs identical)
Code94-99% (small run-by-run variance)
Long-RAG-32K74-81%
Long-RAG-8K71-78%
Q&A67-74%
Hungarian63-67%

The 100% acceptance rate on JSON-KIE is striking. In a 301-token output, every speculative token is hit by the draft. The explanation: JSON structure is so predictable (braces, repeating key names, commas, quotes) that the small MTP draft head trivially predicts the next tokens. The model only really “thinks” at field values — it just runs through the structure.

Natural-language workloads (Hungarian, Q&A) sit at ~63-70%, which matches community baselines.

Production consequence: for DocAI’s structured-output use cases (invoice KIE, proposal canvas JSON, company lookup result), MTP essentially doubles decode tok/s. On the natural-language chat agent, the gain is only ~1.4×.

The 100% acceptance suggests an experiment: it’s worth trying num_speculative_tokens=3. If the draft trivially hits 2 tokens, it likely hits 3 with 80%+ accuracy — and 80% × 3 tokens = 2.4 accepted tokens per step, more than the current 100% × 2. That’s for a future round.

Concurrent stress — where the Spark shows itself

So far we’ve looked at single-stream numbers, which is the one-user scenario. In production this is rarely the case — multiple concurrent requests arrive, and aggregate throughput (not per-request) is what counts.

I wrote a benchmark_concurrent.py script that calls the API with N worker threads for a given duration, with mixed workload types. I tested three mix profiles:

  • uniform_short: 4 concurrent threads, all short prompts (Q&A + Code + JSON + Hungarian round-robin), 90 seconds
  • mixed_typical: 4 concurrent threads, 3 short + 1 Long-RAG-8K (the “DocAI realistic mix”), 90 seconds
  • long_only_128k: 4 or 2 concurrent threads, each with ~108K-token context, 90/180 seconds — the stress test

The numbers:

MixConcurrencyDurationReqs doneAggregate tok/sMTP accept
uniform_short490s2063.8676.13%
mixed_typical490s2057.6573.89%
long_only_128k490s814.6575.89%
long_only_128k2180s66.5475.84%

The uniform_short aggregate hits 63.86 tok/s2.1× the single-stream JSON-KIE peak (30.9 tok/s). Exactly what you’d expect from an MoE: 4 concurrent requests activate different experts, so the 273 GB/s memory bandwidth is better utilized (only ~10B active parameters are read per token, and the 4 requests use different subsets).

mixed_typical at 57.65 tok/s is slightly lower because Long-RAG-8K ties up one of the 4 threads during prefill. The other 3 threads keep generating in the meantime, so the aggregate stays high.

The long_only_128k data, however, is the main lesson of the story:

The Spark wall: 4× 100K concurrently — sadly, no

The first measurement was 4 threads, 90 seconds — 0 requests completed. Looking at the vLLM log: prompt overflow. The workload script’s make_long_rag_messages(115000) estimate used 4 char/token; with Hungarian text actual tokenization is ~1.14× that → 130817 input + 256 output = 131073, 1 token over the 131072 limit. vLLM rejected every request with 400 Bad Request, and the script’s errors: 0 field reported incorrectly (the request got an immediate response, so no explicit timeout fired).

Lowering the target to 95000 tokens (~108K actual Hungarian tokens), the real picture emerged:

c=2, 180 seconds:

  • 6 requests completed in 182.3 seconds → ~60s/request average
  • TTFT p50: 20 seconds (two parallel chunked prefills)
  • TTFT p95: 112 seconds (worst-case for a late arrival)
  • Decode p50: 20.6 tok/s per request (close to single-stream)
  • Aggregate: 6.54 tok/s

c=4, 90 seconds:

  • 8 requests completed in 106.2 seconds → ~52s/request
  • TTFT p50: 20.66s, p95: 40.33s
  • Decode p50: 6.2 tok/s per request
  • Aggregate: 14.65 tok/s

c=4 vs c=2 is 2.2× in aggregate — so more concurrency is slightly better for total throughput, but per-request decode tok/s collapses from 20.6 to 6.2 (a 3.3× drop). From an individual user’s perspective this is catastrophic: 6.2 tok/s means 1 token per 160ms — a 200-word Hungarian answer (~250 tokens) becomes ~40 seconds wall-clock, plus the 20s prefill, plus queueing.

The Spark is not suitable for multi-user, context-heavy interactive load with this 122B-NVFP4 model. What it’s good for:

  1. Single-user interactive DocAI chat with short prompts (24-31 tok/s, 0.3-0.5s TTFT) ✅
  2. DocAI batch KIE for structured JSON output (31 tok/s, 100% MTP acceptance) ✅
  3. Single RAG query up to 30K context (3.87s warm TTFT) ✅
  4. Multi-user chat with short prompts (4 users, ~16 tok/s/user, good UX) ✅
  5. Multi-user RAG with medium-length context (~57 tok/s aggregate) ⚠️ degraded UX
  6. Multi-user document analysis at 100K+ context ❌ not suitable

Item 6 could be relieved with a second Spark (TP=2 cluster, ~70 tok/s aggregate per the community), or by switching to a different NVFP4 backend (Albond v2 hybrid INT4+FP8 + INT8 LM head, ~51 tok/s single-stream per the community, but with quality regression risk).

Behind the numbers: what did we actually measure?

One closing note on what these tables mean and don’t mean. The benchmark uses fixed prompts and fixed max_tokens, and the script runs the same 4-5-6 prompts. This setup specifically favors prefix caching — the 2.59s TTFT on Long-RAG-8K is measured with the same prompt sent four times. In a real-world DocAI use case where every RAG retrieval brings different chunks, TTFT will be closer to the cold first-run number (4.4-18 seconds).

Decode tok/s, on the other hand, is mostly prompt-independent (memory-bandwidth-bound throughput barely depends on what you generate, only how much). The ~25-31 tok/s single-stream range will look similar under real DocAI prod load.

Another caveat: we measured with enable_thinking: false. Officially recommended for KIE tasks (Qwen3.5’s research card and community threads agree consistently). A “thinking on” measurement would tell a different story — a simple Q&A would balloon to 1500+ tokens (the model thinks to itself), and real wall-time would scale proportionally. DocAI doesn’t need this.

Conclusion: should we put it in DocAI?

Not yet. Two factors are missing for a final verdict.

First: a quality benchmark. The current production model is Qwen3.6-35B-A3B-FP8. The 122B-NVFP4 should in theory be smarter (more parameters), but NVFP4 quantization gives some of that back. How many percentage points more accurate is it on Hungarian invoice KIE? How much better is it on Hungarian legal Q&A? I’ll do this measurement in a future round, as part of the “DocAI Hungarian eval harness” already on my list. From the throughput angle:

ModelSingle-stream JSON-KIE tok/sAggregate (4 users)
Qwen3.6-35B-A3B-FP8 (current prod)~50-60~80-100
Qwen3.5-122B-A10B-NVFP4 (measured)30.963.86

The 122B is slower than the 35B-FP8 on this workload. Obviously: twice as much memory has to be read per token. MTP closes the gap by 2.1×, plus 100% acceptance on JSON, but it still falls short.

Second: the actual production usage pattern. Today’s DocAI is single-tenant — one Hungarian SME customer, max 3-5 concurrent users, mostly short chat + KIE batch + occasional RAG. For this, both 122B-NVFP4 and 35B-FP8 are sufficient. If the quality advantage is noticeable — e.g. Hungarian legal Q&A is markedly better on the 122B — then the speed loss is acceptable, since user latency is still tolerable (32K-context JSON-KIE under 10s). If the quality advantage isn’t noticeable, we’ll stay on 35B-FP8 because single-stream is faster and memory pressure is lower.

The Spark hardware side is confirmed by the measurements: single-stream 122B-NVFP4 is production-ready on a single Spark, multi-stream is fine for medium-length context, multi-stream at 100K+ context does not scale.

What else can we try

A few experiments left for a future round:

  • num_speculative_tokens=3 — the JSON-KIE 100% acceptance suggests the 2-token draft is trivially hit; worth trying 3. Estimated gain: +20-30% on structured workloads.
  • txn545/Qwen3.5-122B-A10B-NVFP4 control measurement — slightly different quantizer (NVIDIA Model Optimizer); the question is whether Hungarian quality changes.
  • Albond v2 hybrid INT4+FP8 — community report: 51 tok/s single-stream on a single Spark, but “Qwen3.5-35B Native FP8 + MTP is much better than Qwen3.5-122B int4-AutoRound” — quality regression risk.
  • Second Spark cluster (TP=2) — needed anyway for the 397B-MoE, and on the 122B the community sees ~70 tok/s aggregate. If the multi-user pattern picks up, this could be the answer. A budget question.
  • SGLang — the txn545 checkpoint includes SGLang support. On the same hardware no radical gain is expected (memory-bandwidth bound), but constrained-JSON output might be faster. Build cost isn’t small, so only if JSON-KIE speed becomes a hard bottleneck.

A few last numbers for completeness

After the 14-minute cold start, the Spark ran stable for 90+ minutes through the benchmark series with no OOMs and no restarts. nvidia-smi showed 11 GB GPU memory in use (the rest, ~76 GB, lives on the CPU side of unified memory). The container used 3.22% CPU at idle and the vLLM internal scheduler loop ran around 10-20% CPU under active inference.

Power draw on a single Spark, measured during the benchmark: 3-10W idle, ~150-180W full inference. The LPDDR5x memory is what heats up, not GPU compute (273 GB/s sustained random access is a lot at this chip size).

The full setup (build, model download, 4 measurement rounds, 3 restarts) took ~3 hours of decisions on my side, which means any DGX Spark owner could reproduce these numbers within 24 hours. The eugr/spark-vllm-docker repo, the Sehyo checkpoint, and the 8 vLLM flags above — that’s it.

Based on the numbers, 122B-NVFP4 on a single Spark technically works and is suitable for certain use cases. The product decision — whether it’s right for DocAI — depends on a quality eval, and I don’t have those numbers yet. But the speed we’re measuring now, and we know what to say yes to.


System: NVIDIA DGX Spark, GB10 (SM 12.1), 128 GB LPDDR5x unified memory
Model: Sehyo/Qwen3.5-122B-A10B-NVFP4 (76 GB checkpoint, MTP weights included)
vLLM: 0.19.2rc1.dev154+g1c2c1eb8b (eugr/spark-vllm-docker prebuilt wheel, cu130-nightly base, 2026-04-25)
MoE backend: FLASHINFER_CUTLASS (autoselected on SM121)
Benchmark tooling: custom benchmark_concurrent.py + vllm bench serve

The full JSON result set, the docker run parameters and the concurrent benchmark script are available — if you’re interested in reproduction or detailed percentile distributions, get in touch.