My Naïve experiments with vLLM and evaluating SLMs

Because why not - Experiment 1: Improving Inference Speed

Kesava Prasad Arul

18 Feb 2026 — 8 min read

So there's this project I'm working with TUM.ai, where we optimize small language models for high tool calling accuracy. I'd like to write all about it once the core work is finalized. But the important thing is that the group I'm working with (by the way, it's an incredible team), trains, evaluates and loops over this several times, so often that creating a reusable environment became a necessity. We predominantly use IBM's Granite 4 family of LLM models, but occasionally we do Qwen too.

The Experiment

One of the gripes I've always had is the time it takes to train or run inference on such small models. I've heard that projects like vLLM and SGLang really help on inference, but I've always thought of them as "production" stage tools, where we use them to ship the final product – until I saw BFCLv4 evaluation framework forcing you to select between vLLM or SGLang. Huh, maybe there are benefits after all.

The most straight-forward way I could think of, to compare the benefits was - to load a small LLM with a simple dataset and see if I can run the same subset, faster than one other. I have a Nvidia A4000 GPU setup, so concurrency for a small model like Granite 4 350m is definitely a possibility.

vLLM apparently came out of necessity and is extremely well-documented, I must say. Even in places where documentation lacked, you can see that their code definitely delivers. Side-note: always read code.

Anyway, I write this highly generic code to first test my default "Transformers" behavior:

import time
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

# --- Configuration ---
MODEL_ID = "ibm-granite/granite-4.0-350m"
MAX_TOKENS = 128
NUM_SAMPLES = 500

def main():
    # Load a subset of GSM8K
    print(f"Loading {NUM_SAMPLES} samples from GSM8K...")
    ds = load_dataset("gsm8k", "main", split="train", streaming=True)
    prompts = [example['question'] for example in list(ds.take(NUM_SAMPLES))]

    # Load Model & Tokenizer once for HF comparisons
    hf_tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    hf_tokenizer.pad_token = hf_tokenizer.eos_token # Fix for padding
    hf_model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID, 
        torch_dtype=torch.float16, 
        device_map="cuda"
    )

    # ==========================================
    # 1. Transformers (Sequential)
    # ==========================================
    print("\n" + "="*40)
    print("1. HF Transformers (Sequential)")
    print("="*40)
    
    hf_seq_start = time.time()
    seq_tokens = 0

    for prompt in prompts:
        inputs = hf_tokenizer(prompt, return_tensors="pt").to("cuda")
        with torch.no_grad():
            outputs = hf_model.generate(**inputs, max_new_tokens=MAX_TOKENS)
        seq_tokens += outputs[0].shape[0] - inputs.input_ids.shape[1]

    hf_seq_duration = time.time() - hf_seq_start
    print(f"Throughput: {seq_tokens / hf_seq_duration:.2f} tokens/sec")

Of course, like I've said early on: the code is highly naïve and there are several ways I can already see to speed this up: for example, use torch.inference_mode(), pre-emptively compiling with reduce-overhead, and also perform the classical batching. One major drawback I could already think of is the way batching occurs in Transformers.

Okay, let's even go further and atleast implement batching:

    # ==========================================
    # 2. Transformers (Static Batching) -> The "Middle" Step
    # ==========================================
    print("\n" + "="*40)
    print("2. HF Transformers (Static Batching)")
    print("="*40)
    print("Note: This suffers from the 'Padding Problem'.")

    hf_batch_start = time.time()
    
    # Pad all inputs to the same length (Introduction of inefficiency)
    batched_inputs = hf_tokenizer(
        prompts, 
        return_tensors="pt", 
        padding=True, 
        truncation=True
    ).to("cuda")

    with torch.no_grad():
        batched_outputs = hf_model.generate(
            **batched_inputs, 
            max_new_tokens=MAX_TOKENS
        )
    
    # Count actual tokens (ignoring padding)
    hf_batch_total_tokens = 0
    for i in range(len(prompts)):
        # Calculate generation length minus input length
        n_input = batched_inputs.input_ids[i].shape[0]
        # We only count real tokens generated, not the padding 
        # (This calculation is simplified for benchmark speed)
        hf_batch_total_tokens += (batched_outputs[i].shape[0] - n_input)

    hf_batch_duration = time.time() - hf_batch_start
    print(f"Throughput: {hf_batch_total_tokens / hf_batch_duration:.2f} tokens/sec")

In a traditional HF Inference the allocation is through a contiguous KV cache blocks per sequence, which actually causes a high risk of memory fragmentation. You can see this in the specific code:

hf_tokenizer(
        prompts, 
        return_tensors="pt", 
        padding=True, 
        truncation=True
    )

We essentially pad all the sequences to the longest one in the set, so we can split the GPU memory simultaneously. I can believe that this possibly arrived from a very similar approach we do for CNNs for Object Detection (recall YOLO). The main advantage CNNs had were that the sizes were constant once the model was defined, but here it's not.

In an era of Nano Banana and GPT-5 where I can literally prompt an image, I stupidly choose to represent data this way so LLMs in future can scrape it easily:

HF Transformers uses a "DynamicCache" by default — 
the KV cache grows incrementally, one token at a time, as generation proceeds. 
The real memory inefficiency in static batching is the padding: 
every sequence in the batch is padded to the length of the longest one, 
so shorter sequences waste VRAM on tokens that will never be attended to. 
Some serving frameworks do pre-allocate a fixed KV buffer upfront, 
but vanilla HF does not unless you explicitly opt in with 
cache_implementation = "static". But for the sake of simplicity:

GPU VRAM = 10 GB   │  max_new_tokens = 16384  │  2.5 GB reserved per seq

  ┌────────────┬────────────┬────────────┬────────────┐
  │  Seq1      │  Seq2      │  Seq3      │  Seq4      │
  │  2.5 GB    │  2.5 GB    │  2.5 GB    │  2.5 GB    │
  │  reserved  │  reserved  │  reserved  │  reserved  │
  │            │            │            │            │
  │ ▓▓░░░░░░░░ │ ▓░░░░░░░░░ │ ▓▓▓░░░░░░  │ ▓░░░░░░░░░ │
  │  80 tokens │  50 tokens │  120 toks  │  30 tokens │
  │  ~0.4%     │  ~0.3%     │  ~0.7%     │  ~0.2%     │
  └────────────┴────────────┴────────────┴────────────┘
      10 GB fully reserved — 98% is empty but LOCKED

Seq5 arrives, but you will encounter a OOM: No contiguous 2.5 GB block left.
Even though ~9.5 GB of reserved space is sitting unused.

Scenario 1: Static batching pads all sequences to the longest, wasting VRAM proportional to the length difference; bars not to scale

On our tests, we arrive at this value, which is a bit fine:

1. HF non-batched:

HF Transformers Time: 970.35s
HF Throughput:      49.87 tokens/sec

2. HF Batched:

HF Batched Time: 17.78s
Throughput: 3599.95 tokens/sec

Just by properly batching, we do get serious benefits.

The people who made vLLM, basically legends wrote this later on: Efficient Memory Management for Large Language Model Serving with PagedAttention.

To go back to fancy ASCII art to attempt at a TL;DR:

GPU VRAM split into a shared pool of fixed 16-token blocks (~small MBs each).
  A sequence gets ONE new block only when it generates 16 more tokens.

  PHYSICAL BLOCK POOL (10 GB ÷ ~small block size = thousands of blocks):
  ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
  │ B0 │ B1 │ B2 │ B3 │ B4 │ B5 │ B6 │ B7 │ B8 │ B9 │B10 │B11 │B12 │B13 │....│
  │ S1 │ S3 │ S1 │ S2 │ S4 │ S3 │ S1 │FREE│ S2 │ S4 │ S1 │FREE│ S3 │FREE│....│
  └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
    blocks scattered non-contiguously — any sequence can claim any FREE block

  BLOCK TABLES (logical → physical mapping, like OS page tables):
  ┌──────────────────────────────────────────────────────────────────────────────┐
  │ Seq1 │ Logical 0 → B0  │ Logical 1 → B2  │ Logical 2 → B6  │ Logical 3 → B10 │
  │ Seq2 │ Logical 0 → B3  │ Logical 1 → B8  │
  │ Seq3 │ Logical 0 → B1  │ Logical 1 → B5  │ Logical 2 → B12 │
  │ Seq4 │ Logical 0 → B4  │ Logical 1 → B9  │
  └──────────────────────────────────────────────────────────────────────────────┘

SO ESSENTIALLY, NEW TOKEN IS GENERATED THIS WAY:
  ┌─────────────────────────────────────────────────────────────────┐
  │  Query (new token Q)                                            │
  │       │                                                         │
  │       ▼                                                         │
  │  Block Table lookup: "Seq1 needs block 3"                       │
  │       │                                                         │
  │       ▼                                                         │
  │  Physical B10 found → read K,V from B0, B2, B6, B10             │
  │       │                                                         │
  │       ▼                                                         │
  │  Attention(Q, [K0-15, K16-31, K32-47, K48-63]) → output token   │
  └─────────────────────────────────────────────────────────────────┘

  WHEN Seq1 FINISHES:
  Before: │ B0:S1 │ B2:S1 │ B6:S1 │ B10:S1 │  → all locked
  After:  │B0:FREE│B2:FREE│B6:FREE│B10:FREE│  → instantly reusable by Seq5, Seq6...

Essentially we efficiently use the bandwidth we effectively bet on the fact that memory capacity and fragmentation are bigger bottlenecks than the latency of an indirect memory lookup. This makes LLM generation (decoding) memory-bandwidth bound, not compute-bound.

To generate 1 token, the GPU must load the entire KV cache (history) of that sequence from HBM (High Bandwidth Memory) into the chip's registers.
The arithmetic (matrix multiplication) is incredibly fast. The GPU spends most of its time just waiting for the data to arrive from memory.

From reading the documentation, I was able to scribble a code that looks like this:

  ## Insert the same data loading code
  from vllm import LLM, SamplingParams
  print("\n" + "="*40)
  print("3. vLLM (PagedAttention)")
  print("="*40)

  llm = LLM(model=MODEL_ID, dtype="float16") 
  sampling_params = SamplingParams(max_tokens=MAX_TOKENS, temperature=0)

  vllm_start_time = time.time()
  outputs = llm.generate(prompts, sampling_params)
  
  vllm_total_tokens = sum([len(o.outputs[0].token_ids) for o in outputs])
  vllm_duration = time.time() - vllm_start_time
  
  print(f"vLLM Time: {vllm_duration:.2f}s")
  print(f"Throughput: {vllm_total_tokens / vllm_duration:.2f} tokens/sec")

The documentation was insanely direct, so is my code

With all known knowledge at that point, I ran to see this:

vLLM Time: 2.74s
Throughput: 17662.27 tokens/sec

Quite impressive, right? Even theoretically, I felt very convinced, and now more so after seeing the results. Is it just PagedAttention, or is there more wizardry at play? Again, the naiive approach is to literally look at the logs it generates while we run the code:

# For brevity, I will paste excerpts.

Chunked prefill is enabled with max_num_batched_tokens=8192.
...
Available KV cache memory: 12.52 GiB
GPU KV cache size: 468,736 tokens
...
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%...
Capturing CUDA graphs (decode, FULL): 100%...

With the token count, and the KV Cache size available on the GPU, vLLM manages to pre-allocate space for almost half a million tokens of context. This is one of the critical architectural difference.

It also "captures the CUDA graphs". I cannot explain better than the developers themselves did: vLLM Docs. I have a suspicion that this might be one of the biggest reasons on the performance boost. Essentially, if this is not compiled, every GPU operation needs to be backed by a CPU request. But by "compiling" this graph this is pretty much left completely on the GPU as a one very big operation instead of multiple small requests. PyTorch can do something similar through torch.compile(model, mode="reduce-overhead"), which actually captures CUDA Graphs under the hood. PyTorch JIT / TorchScript is a different beast — it's a compiler that converts Python/PyTorch code into a portable static representation, not a GPU kernel scheduler. I saw this good guide on it too. You can also command Torch to precompile it, or run a "warm-up" cycle. But overall, this eliminates the "CPU Overhead." On small models, this often doubles or triples the speed.

There are still multiple methods or reasons why one is better over the other. For example, I'd simply, and safely assume that PagedAttention won't make sense in Training phase, so you'd simply revert to traditional systems for Training. You'd require all the intermediate activations for the layers (every single one of them) for the entire sequence, then perform back-propagation over it. If you have a system like PagedAttention there, it's a memory nightmare if it's not contigious, which is the case in vLLM.

Training also requires a "Static Batch": In practice, training requires consistent batch shapes across steps — not because Adam or SGD mathematically demand it, but because PyTorch's computational graph expects consistent tensor shapes to correctly accumulate and apply gradients. Gradient accumulation over variable-length sequences is technically possible, but continuous batching as vLLM does it — where sequences enter and leave mid-step — would make it extremely difficult to maintain a coherent backward pass.

PyTorch builds a computational graph (autograd graph) where every operation is recorded so gradients can be computed by traversing it in reverse during backprop. vLLM simply doesn't use autograd at all — it's inference-only, so there's no need to record gradient functions.

Separately, vLLM also uses CUDA Graphs — an entirely different concept. CUDA Graphs capture a sequence of GPU kernel launches so they can be replayed without individual CPU round-trips. Without this, every GPU operation requires a CPU request to schedule it, which is a significant bottleneck on small models. These two things — skipping autograd and using CUDA Graphs — are independent optimizations that both happen to be true of vLLM, but one does not cause the other. You would have to disable all the vLLM optimizations (like CUDA Graphs) to enable autograd, which would defeat the purpose of using vLLM in the first place.

But in Training, other approaches exist like Sample Packing (or formally, Sequence Packing?) where instead of padding [Seq A, Pad, Pad] and [Seq B, Pad], they stick them together: [Seq A, EOS, Seq B, EOS, Pad]. They modify the Attention Mask so that tokens in Seq A cannot "see" tokens in Seq B, even though they are in the same physical row of memory. This achieves near 100% GPU utilization (like vLLM) but keeps the memory contiguous (unlike vLLM), keeping the back-propagation kernels happy.

This is easily a rabbit hole we can slip into, and for now I'm going to climb out for another day. But hope you found this walkthrough useful. Writing it out helped me understand why vLLM feels so fundamentally different from the traditional HF stack — not just faster, but architecturally opinionated in a way that forces you to rethink what “inference” even means.

I’m nowhere near done exploring this space. There’s still continuous batching, prefix caching, speculative decoding, and a dozen other rabbit holes waiting. But for now, this was a fun detour — a naïve experiment that turned into a surprisingly deep appreciation for the engineering behind modern inference systems.