Standing Up vLLM on a Single A10G: From First Boot to Dual-Model Deployment

I set out to stand up vLLM on a single AWS g5.4xlarge (A10G, 24 GB) using the Deep Learning OSS Nvidia Driver AMI (Ubuntu, PyTorch 2.7). The goal wasn’t just "it runs" but something I’d be comfortable calling production-ready (albeit small: gpu instances can be $1/hr), reproducible, observable, hardened, and easy to roll forward/back.
What began as a single-model deployment evolved into a debugging marathon, culminating in two quantized models (draft and target) running stably on a single GPU. I have below the condensed story.
Architecture Choices That Mattered
- Containerized vLLM (
vllm/vllm-openai:v0.10.1.1): instead of bare-metal. Containers pin CUDA/PyTorch and give clean rollbacks. - NVIDIA Container Toolkit (1.17.8-1): enables Docker to utilize the GPU (
--gpus all). The DLAMI has drivers, but containers still need the toolkit runtime. - systemd-supervised Docker services: auto-restart, single log stream (
journalctl -u vllm-*), and health-gating before marking “ready.” - Persistent Hugging Face cache (
/opt/hf-cache): avoids multi-GB re-downloads on reboot. - Security: SG locked to my IP, IMDSv2 required, no public access to inference ports.
- EBS gp3 (200 GB): sized for cache + logs, with Docker log rotation to prevent disk-fill.
Terraform + Cloud-Init Gotchas
- AMI lookup: I pulled the latest DLAMI via Terraform:
terraform
data "aws_ami" "deep_learning" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.7 *Ubuntu*"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}- Interpolation collisions:
${…}clashed between Terraform and Bash. Fix: escape with$$,or just remove the{}when I wanted Bash/systemd expansion. - Signed-By conflicts in APT: solved by following NVIDIA’s install docs verbatim. My RTFM moment.
- User-data only runs once: after edits, I had to
cloud-init clean --rebootor recreate the instance.
Docker Runtime Pitfalls
- Overwriting
daemon.json: fixed with an atomic merge viajqinstead of naivetee. - CUDA mismatch (exit 125): sanity check with
nvidia-smion host and inside container. The DLAMI driver must support the CUDA runtime expected byvllm/vllm-openai:v0.10.1.1. - Shared memory crash: solved by
--ipc=host --shm-size=8gplus raised ulimits.
Verification commands that saved me:
docker!
docker info | grep 'Default Runtime'
docker run --rm --gpus all vllm/vllm-openai:v0.10.1.1 nvidia-smi
vLLM Runtime Tuning for A10G (24 GB)
At first I focused on a single model (Mistral-7B GPTQ), with conservative settings:
memory
--gpu-memory-utilization 0.90
--max-model-len 3072
--max-num-seqs 12
--enable-chunked-prefill --swap-space 4
--enforce-eagerRule of thumb: context↑ ⇒ concurrency↓. The KV-cache drives VRAM, so does parameter count and quantization, so I trade between max-model-len, max-num-seqs, and memory cap.
... tuning is hard.
Scaling Up: Draft + Target on One GPU
I'm implementing speculative decoding on the client side, for pedagogical purposes. Need a fast draft model and a heavier target model running together.
- Draft:
RedHatAI/Mistral-7B-Instruct-v0.3-GPTQ-4bit - Target:
RedHatAI/granite-3.1-8b-instruct-quantized.w8a8
** This is working... but flat out wrong... model families are different and parameter sizes are likely to close **
The Memory Problem
The A10G only has 24 GB VRAM:
- Mistral-GPTQ ≈ 4–6 GB
- Granite-W8A8 ≈ 10–12 GB
- KV-cache ≈ unpredictable without caps
The Fix
Cap each server’s VRAM and bound KV-cache growth. Final envs looked like this:
flags?
Draft @ :8000
GPU_MEM_UTIL=0.42
MAX_MODEL_LEN=1792
VLLM_FLAGS="--enable-chunked-prefill --swap-space 8 --max-num-seqs 12 --enforce-eager"
Target @ :8001
GPU_MEM_UTIL=0.42
MAX_MODEL_LEN=2048
VLLM_FLAGS="--enable-chunked-prefill --swap-space 8 --max-num-seqs 8 --enforce-eager"Key points:
- Both draft and target capped at ~42% of VRAM, leaving headroom.
- I did not need
--trust-remote-codeor--max-num-batched-tokens— both models loaded fine with just the base flags. --swap-space 8lets vLLM spill buffers into CPU RAM (64 GB available).
Networking & Ops
- Health endpoints wired into systemd wait loops (boot can exceed minutes due to weight downloads).
- Observability:
journalctl -u vllm-*,docker logs, and a/root/test_vllm.shsmoke test. - Debugging flow: cloud-init logs → docker GPU test → systemd status → container logs → foreground run.
The Final Steady State
- Two vLLM services (draft + target) running concurrently on one A10G.
- Stable memory use (< 22 GB under load, no OOMs).
- Production-toy hardening: pinned image tag (
vllm/vllm-openai:v0.10.1.1), atomic config merges, log rotation, systemd auto-restart, locked-down ports. - Outcome: both
/healthendpoints return{"status":"ok"}, and speculative decoding works.
Takeaway
You can run two quantized LLMs on a single mid-tier GPU if you:
- Containerize and pin dependencies,
- Supervise with systemd,
- Persist Hugging Face cache,
- Bound KV-cache growth with
max-model-len+max-num-seqs, and - Leave headroom in VRAM.
It’s a production-ready toy: stable, reproducible, observable, and hardened — exactly what I wanted...
but LOL.. models choice is so wrong... next article is about model choice.