Speculative Decoding and the Model Choice: Lessons

In my last post, I described how I got two quantized LLMs (draft + target) running side-by-side on a single AWS g5.4xlarge (A10G, 24 GB). But there was a problem: the models themselves were a bad pairing. This follow-up is about that journey: what I picked at first, why it didn’t make sense, and what I learned along the way.

Where I Started

For speculative decoding, you need:

A draft model: cheap and fast, proposes bursts of tokens.
A target model: heavier, higher-fidelity, verifies, and corrects.

I thought, 'Why not just grab two quantized models from Hugging Face?'

So I stood up... after some simple Googling and ChatGPT'ing..

Draft: RedHatAI/Mistral-7B-Instruct-v0.3-GPTQ-4bit
Target: RedHatAI/granite-3.1-8b-instruct-quantized.w8a8

On paper, it looked fine: both were quantized, both fit under 24 GB VRAM, and both came up cleanly under vLLM.

The Realization: Wrong Families, Wrong Gap

It only clicked after I stepped through my first naive spec decode impl and went through 0 verifications... (this is prolly ML 101, I'm getting it...)

Different families. Mistral and Granite don’t share tokenizers, training data, or instruction-tuning lineage. Their probability distributions diverge, which wrecks speculative decoding acceptance.
Too close in size. A 7B draft vs. an 8B target leaves little "quality gap." The draft is almost as heavy as the target, but not significantly worse in terms of prediction. That means you don’t save much latency, and you don’t get a meaningful uplift in output quality.
Memory pressure. Both are mid-sized models. Add KV-cache growth on a 24 GB GPU, and you’re in a constant trade-off between context length and concurrency.

In other words, the deployment was technically stable but architecturally wrong.

The Fix: Stay in the Family

The lesson was non-obvious (to me) but straightforward: draft and target must be from the same model family.

Why?

Same tokenizer → higher acceptance rates.
Aligned instruction-tuning → fewer rejections.
Quantization or parameter gap → speed and quality trade-off actually show up.

For code-aware use cases, the one I'm working on... more to come later; that probably means:

Draft = CodeLlama-7B-Python-GPTQ (4-bit, fast).
Target = CodeLlama-13B-Python-GPTQ (4-bit, higher-fidelity), or the fp16 7B if VRAM is tight.

This way, the models "speak the same dialect" but at different speeds. (The latency on the forward pass through the model is longer with 13B parameters)

What I Learned

Speculative decoding. It only works when draft and target distributions are closely aligned (It also only makes sense on the server with the inference platform... but I'm just making sure I understand the algorithm)
Parameter gap matters. 3B → 8B, or 7B → 13B, gives you both speedup and quality uplift. 7B → 8B is just noise.
Quantization is your friend. 4-bit GPTQ for draft, 8-bit for target, is a practical way to pack two models into a 24 GB GPU.
Pedagogy vs. production. It’s fine to start with “wrong” models to prove the deployment path, but the learning curve is steep when you actually care about latency, acceptance rate, and output quality.

Takeaway

Model choice is as integral to systems engineering as Terraform scripts and systemd services. With speculative decoding, the family match and the parameter gap drive whether you get a real speedup or just a science project.

The next iteration for me is straightforward: CodeLlama 7B draft and CodeLlama 13B target, both quantized, same family, real quality gap, and a much better fit for this journey.