Our Local AI Coding Stack: Qwen3.6-35B on vLLM — the Real Numbers
21 June 2026 · jproxx
Our development team works with a language model that runs entirely on our own hardware — on our own network, with no external interface. We use it as a programming assistant: for writing, explaining and revising source code. This post describes the setup and shows real measurements from live operation.
Why local
Three reasons speak for running it ourselves. First, the source code stays on our own network — nothing is sent to an external service. Second, we rely on a model with open weights that we can run, inspect and version ourselves, rather than on an interface that bills per request and can change at any time. Third, the costs are predictable: our own hardware instead of usage-based billing.
The stack
We run Qwen3.6-35B-A3B in its memory-efficient FP8 variant — a mixture-of-experts model of which only about 3 billion of its 35 billion parameters are active per request. That delivers the quality of a large model at the speed and energy footprint of a much smaller one, and makes running it on a single GPU server economical.
The model is served through vLLM, an open-source inference server. Three building blocks keep the work fluid:
- A 256,000-token context window — large enough to process whole source files or specifications in one go.
- Speculative decoding with a lean draft model (the “DFlash” method): a small model proposes several tokens at once, and the large model confirms them in a single step — which saves compute steps.
- Prefix caching: recurring parts of a request are served from the cache instead of being recomputed.
The numbers from live operation
Real measurements from the vLLM server across one continuous session. Acceptance falls across the draft positions as expected — which is exactly where the speed-up comes from.
Three observations from one continuous working session:
- Throughput of around 30 tokens per second for an active request — faster than a person can read along, and so comfortable for interactive work.
- Over 93 percent prefix-cache hit rate: recurring input tokens come from the cache rather than from a recomputation.
- Around 2.5 confirmed tokens per model step through speculative decoding; the acceptance rate falls across the draft positions (0.64 / 0.47 / 0.37) — and that effect is exactly what produces the speed-up.
Throughout, GPU cache usage stayed at just 2 to 7 percent — so there is plenty of headroom to serve several requests in parallel. On a single GPU server, that is enough for a fully interactive programming assistant, without any data leaving our own network.