Skip to content
jproxx
← Back to the blog

Our Local AI Coding Stack: Qwen3.6-35B on vLLM — the Real Numbers

21 June 2026 · jproxx

Our development team works with a language model that runs entirely on our own hardware — on our own network, with no external interface. We use it as a programming assistant: for writing, explaining and revising source code. This post describes the setup and shows real measurements from live operation.

Why local

Three reasons speak for running it ourselves. First, the source code stays on our own network — nothing is sent to an external service. Second, we rely on a model with open weights that we can run, inspect and version ourselves, rather than on an interface that bills per request and can change at any time. Third, the costs are predictable: our own hardware instead of usage-based billing.

The stack

We run Qwen3.6-35B-A3B in its memory-efficient FP8 variant — a mixture-of-experts model of which only about 3 billion of its 35 billion parameters are active per request. That delivers the quality of a large model at the speed and energy footprint of a much smaller one, and makes running it on a single GPU server economical.

The model is served through vLLM, an open-source inference server. Three building blocks keep the work fluid:

  • A 256,000-token context window — large enough to process whole source files or specifications in one go.
  • Speculative decoding with a lean draft model (the “DFlash” method): a small model proposes several tokens at once, and the large model confirms them in a single step — which saves compute steps.
  • Prefix caching: recurring parts of a request are served from the cache instead of being recomputed.

The numbers from live operation

Generation throughput — one real coding session (≈ 3 minutes)
0 10 20 30 40 tokens/s
Speculative decoding — acceptance rate per draft position
0% 25% 50% 75% 100% 64% Position 1 47% Position 2 37% Position 3
93,5 % prefix-cache hits
≈ 2,5 tokens per model step
≈ 30 tokens/s on average

Real measurements from the vLLM server across one continuous session. Acceptance falls across the draft positions as expected — which is exactly where the speed-up comes from.

Three observations from one continuous working session:

  • Throughput of around 30 tokens per second for an active request — faster than a person can read along, and so comfortable for interactive work.
  • Over 93 percent prefix-cache hit rate: recurring input tokens come from the cache rather than from a recomputation.
  • Around 2.5 confirmed tokens per model step through speculative decoding; the acceptance rate falls across the draft positions (0.64 / 0.47 / 0.37) — and that effect is exactly what produces the speed-up.

Throughout, GPU cache usage stayed at just 2 to 7 percent — so there is plenty of headroom to serve several requests in parallel. On a single GPU server, that is enough for a fully interactive programming assistant, without any data leaving our own network.