Standing Up vLLM on a Single A10G: From First Boot to Dual-Model Deployment

AI EngineeringInference-Aware AI
Standing Up vLLM on a Single A10G: From First Boot to Dual-Model Deployment

I set out to stand up vLLM on a single AWS g5.4xlarge (A10G, 24 GB) using the Deep Learning OSS Nvidia Driver AMI (Ubuntu, PyTorch 2.7). The goal wasn’t just "it runs" but something I’d be comfortable calling production-ready (albeit small: gpu instances can be $1/hr), reproducible, observable, hardened, and easy to roll forward/back.

What began as a single-model deployment evolved into a debugging marathon, culminating in two quantized models (draft and target) running stably on a single GPU. I have below the condensed story.

Architecture Choices That Mattered

  • Containerized vLLM (vllm/vllm-openai:v0.10.1.1): instead of bare-metal. Containers pin CUDA/PyTorch and give clean rollbacks.
  • NVIDIA Container Toolkit (1.17.8-1): enables Docker to utilize the GPU (--gpus all). The DLAMI has drivers, but containers still need the toolkit runtime.
  • systemd-supervised Docker services: auto-restart, single log stream (journalctl -u vllm-*), and health-gating before marking “ready.”
  • Persistent Hugging Face cache (/opt/hf-cache): avoids multi-GB re-downloads on reboot.
  • Security: SG locked to my IP, IMDSv2 required, no public access to inference ports.
  • EBS gp3 (200 GB): sized for cache + logs, with Docker log rotation to prevent disk-fill.

Terraform + Cloud-Init Gotchas

  • AMI lookup: I pulled the latest DLAMI via Terraform:
terraform
data "aws_ami" "deep_learning" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.7 *Ubuntu*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}
  • Interpolation collisions: ${…} clashed between Terraform and Bash. Fix: escape with $$, or just remove the {} when I wanted Bash/systemd expansion.
  • Signed-By conflicts in APT: solved by following NVIDIA’s install docs verbatim. My RTFM moment.
  • User-data only runs once: after edits, I had to cloud-init clean --reboot or recreate the instance.

Docker Runtime Pitfalls

  • Overwriting daemon.json: fixed with an atomic merge via jq instead of naive tee.
  • CUDA mismatch (exit 125): sanity check with nvidia-smi on host and inside container. The DLAMI driver must support the CUDA runtime expected by vllm/vllm-openai:v0.10.1.1.
  • Shared memory crash: solved by --ipc=host --shm-size=8g plus raised ulimits.

Verification commands that saved me:

docker!
docker info | grep 'Default Runtime'
docker run --rm --gpus all vllm/vllm-openai:v0.10.1.1 nvidia-smi

vLLM Runtime Tuning for A10G (24 GB)

At first I focused on a single model (Mistral-7B GPTQ), with conservative settings:

memory
--gpu-memory-utilization 0.90
--max-model-len 3072
--max-num-seqs 12
--enable-chunked-prefill --swap-space 4
--enforce-eager

Rule of thumb: context↑ ⇒ concurrency↓. The KV-cache drives VRAM, so does parameter count and quantization, so I trade between max-model-len, max-num-seqs, and memory cap.

... tuning is hard.

Scaling Up: Draft + Target on One GPU

I'm implementing speculative decoding on the client side, for pedagogical purposes. Need a fast draft model and a heavier target model running together.

  • Draft: RedHatAI/Mistral-7B-Instruct-v0.3-GPTQ-4bit
  • Target: RedHatAI/granite-3.1-8b-instruct-quantized.w8a8

** This is working... but flat out wrong... model families are different and parameter sizes are likely to close **

The Memory Problem

The A10G only has 24 GB VRAM:

  • Mistral-GPTQ ≈ 4–6 GB
  • Granite-W8A8 ≈ 10–12 GB
  • KV-cache ≈ unpredictable without caps

The Fix

Cap each server’s VRAM and bound KV-cache growth. Final envs looked like this:

flags?
Draft @ :8000

GPU_MEM_UTIL=0.42
MAX_MODEL_LEN=1792
VLLM_FLAGS="--enable-chunked-prefill --swap-space 8 --max-num-seqs 12 --enforce-eager"

Target @ :8001

GPU_MEM_UTIL=0.42
MAX_MODEL_LEN=2048
VLLM_FLAGS="--enable-chunked-prefill --swap-space 8 --max-num-seqs 8 --enforce-eager"

Key points:

  • Both draft and target capped at ~42% of VRAM, leaving headroom.
  • I did not need --trust-remote-code or --max-num-batched-tokens — both models loaded fine with just the base flags.
  • --swap-space 8 lets vLLM spill buffers into CPU RAM (64 GB available).

Networking & Ops

  • Health endpoints wired into systemd wait loops (boot can exceed minutes due to weight downloads).
  • Observability: journalctl -u vllm-*, docker logs, and a /root/test_vllm.sh smoke test.
  • Debugging flow: cloud-init logs → docker GPU test → systemd status → container logs → foreground run.

The Final Steady State

  • Two vLLM services (draft + target) running concurrently on one A10G.
  • Stable memory use (< 22 GB under load, no OOMs).
  • Production-toy hardening: pinned image tag (vllm/vllm-openai:v0.10.1.1), atomic config merges, log rotation, systemd auto-restart, locked-down ports.
  • Outcome: both /health endpoints return {"status":"ok"}, and speculative decoding works.

Takeaway

You can run two quantized LLMs on a single mid-tier GPU if you:

  1. Containerize and pin dependencies,
  2. Supervise with systemd,
  3. Persist Hugging Face cache,
  4. Bound KV-cache growth with max-model-len + max-num-seqs, and
  5. Leave headroom in VRAM.

It’s a production-ready toy: stable, reproducible, observable, and hardened — exactly what I wanted...

but LOL.. models choice is so wrong... next article is about model choice.