I recently started looking into LoRA + RL, and as a side project I dug into how SGLang implements LoRA serving. This post is a concise walkthrough.
TL;DR
At a high level, SGLang’s LoRA support looks like this (source: https://arxiv.org/abs/2311.03285): it separates base-model computation from LoRA-adapter computation, computes them independently, and adds them together. This enables one base model to serve multiple adapters.
To batch requests that use different adapters, SGLang uses SGMV (Segmented Gather Matrix-Vector Multiplication) (source: https://arxiv.org/pdf/2310.18547), so a mixed-adapter batch can still be handled efficiently with shared kernels.
SGMV overview
For multi-LoRA serving, SGLang keeps a LoRA pool in main memory (see figure below). It only loads adapters needed by the current workload into VRAM; the rest stay in CPU memory. To reduce loading overhead, you can enable overlap loading and prefetch adapters while the previous batch is still running.
LoRA Memory Pool
That is the short version. Now let’s break it down in more detail.
LoRA Repository
If you want background on base-model weight loading, see my other blog.
Let’s first look at what a LoRA repo contains. Using Qwen3-4B LoRA as an example, you typically get:
adapter_config.json
adapter_model.safetensors
README.md
In adapter_config.json, below are the fields most relevant to SGLang:
{
"base_model_name_or_path": "Qwen/Qwen3-4B", # base model name
"lora_alpha": 32, # effective scaling is alpha / r
"peft_type": "LORA", # must be "LORA"
"r": 8, # LoRA rank
"target_modules": [
"q_proj",
"v_proj"
] # which modules to apply LoRA to
}
Now check what is inside safetensors. LoRA follows:
W' = W + scale * (B @ A), where scale = lora_alpha / r.
So LoRA params always come in A/B pairs, e.g.:
base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight shape=(8, 2560) # 8 is lora rank (r)
base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight shape=(4096, 8)
And the corresponding base-model weight in Qwen-4B is:
model.layers.0.self_attn.q_proj.weight shape (4096, 2560)
The shapes are compatible.
LoRA Module Discovery and Replacement
To load LoRA weights, SGLang mainly does two things:
- Load adapter weights to CPU and organize them (via
LoRAAdapterclass). - Replace matching base-model modules with LoRA-aware wrappers (core logic in
init_lora_modulesinlora_manager.py).
A key detail: SGLang normalizes module names first. For fused implementations (like Qwen3 attention), q_proj/k_proj/v_proj are normalized to qkv_proj, and gate_proj/up_proj to gate_up_proj. So replacement usually happens on fused module names.
Using the same example (q_proj and v_proj in LoRA config), SGLang does:
- Load LoRA adapter weights to CPU.
- Normalize target module names to fused granularity (e.g.,
qkv_proj). - Normalize LoRA weights (
_normalize_weights) to match fused base-model structure:- Find all
q_projLoRA items, e.g.:base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight - Resolve
k_proj/v_projby name; ifk_projis missing, fill with zeros. - Concatenate
q/k/valong output dimension intoqkv_proj, e.g.:base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.weight - Drop old
q_proj/k_proj/v_projentries and keep fused outputs.
- Find all
After these two normalization steps (module names + weights), LoRA weight granularity matches base-model module granularity, enabling stable module replacement and loading.
For example, in Qwen3, base weight model.layers.0.self_attn.qkv_proj.weight corresponds to QKVParallelLinear, and its corresponding LoRA wrapper is QKVParallelLinearWithLoRA (see get_lora_layer mapping in lora/layers.py).
LoRA Layers
So what exactly is different between a replaced LoRA layer and a normal layer?
At a high level, a LoRA layer is a wrapper that adds LoRA capabilities to a base layer.
Take QKVParallelLinearWithLoRA as an example (code: https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/layers.py#L421):
def __init__(
self,
base_layer: QKVParallelLinear, # wrapped base target
lora_backend: BaseLoRABackend, # backend that runs LoRA compute
) -> None:
It is connected to LoRA memory pool buffers via set_lora_info:
def set_lora_info(
self,
A_buffer: torch.Tensor,
B_buffer: torch.Tensor,
):
self.set_lora = True
self.A_buffer = A_buffer
self.B_buffer = B_buffer
The key runtime path is forward (inherited from ColumnParallelLinearWithLoRA):
- Run base-layer computation.
- Run LoRA computation through backend and add it to base output.
Exactly like the first diagram.
SGMV overview
Important detail: a layer instance is not bound to one specific adapter. Multiple adapters (say 1/2/3 in the diagram) all reside in the layer’s A/B buffer pool, and per-batch routing is handled by compute backend.
LoRA layers also provide TP-specific support (e.g., rank-local slicing rules). Skipped here; see https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/layers.py#L467-L495
LoRA Backend
LoRA backend is where actual LoRA compute happens. Start from the base class (BaseLoRABackend) and a minimal method:
def run_lora_a_sgemm(
self,
x: torch.Tensor,
weights: torch.Tensor,
*args,
**kwargs
) -> torch.Tensor:
This computes the LoRA-A part (x @ A^T).
xhas shape(s, input_dim), wheresis total flattened token count in the current batch.weightshas shape(num_lora, c * r, input_dim).
num_lora corresponds to max_loras_per_batch (pool slot count per batch).
For r: the tensor capacity is allocated with max_lora_rank, but each adapter has its own real rank config.r, and runtime uses lora_ranks (see below function) to apply the effective rank (r <= max_lora_rank).
c is a stack multiplier (e.g., 2 for gate_up, 3 for qkv). There are also specialized fused paths like run_qkv_lora.
Besides compute methods (e.g., run_lora_b_sgemm), backend also needs routing metadata via prepare_lora_batch:
def prepare_lora_batch(
self,
forward_batch: ForwardBatch, # sequence lengths, mode, etc.
weight_indices: list[int], # which LoRA slot each sequence uses
lora_ranks: list[int], # rank per LoRA slot (real rank)
scalings: list[float], # scaling per LoRA slot
use_cuda_graph: bool,
):
So per forward pass: backend first prepares batch metadata, then runs compute. Thus, one batch metadata can be reused in different compute.
SGLang provides multiple backend implementations:
torch_backend: naive/reference torch implementationchunked_backend: default backend (explained below)triton_backend.py: high-performance Triton implementationascend_backend.py: Ascend NPU implementation
This backend abstraction cleanly decouples compute implementation from scheduling/memory logic.
Now let’s zoom into default ChunkedSgmvLoRABackend, based on SGMV (https://arxiv.org/abs/2310.18547). Starting from prepare_lora_batch, it does:
-
Determine
chunk_size. The goal is not simply “one kernel compute per adapter”; it is to avoid very long segments (e.g., 90% of tokens using one adapter) causing imbalance and long-tail latency. Chunking long segments improves parallelism and stabilizes throughput/latency. -
Logical reordering (group by adapter). At token level, it computes
weight_indexper token, thenargsortto getpermutation, so tokens for the same adapter become contiguous logically. Example: adapter IDs[1,1,0,1,0]become[0,0,1,1,1], withpermutation=[2,4,0,1,3].permutation[logical_idx] = physical_idx. This is not physical tensor reorder; it is index mapping used by kernels for indirect read/write. -
Build segment metadata (
seg_weight_indices,seg_indptr) and packbatch_info. A segment is a token group using one LoRA slot. Chunking happens here: long groups are split bychunk_size.seg_indptris CSR-style boundaries;seg_weight_indicesstores LoRA slot per segment. -
Copy rank/scaling and metadata to GPU. CPU tensors use
pin_memory=Trueand GPU copy usesnon_blocking=Truefor efficient asynchronous H2D transfer.
The comments in this file are very detailed: https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/backend/chunked_backend.py
After metadata is ready, kernels run. For example, run_lora_a_sgemm calls chunked_sgmv_lora_shrink_forward with x, weights, and prepared metadata. Conceptually, reorder + chunking enables one kernel pipeline to execute mixed-adapter LoRA workloads efficiently.
Kernel entry (for interested readers): https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/triton_ops/chunked_sgmv_shrink.py#L122
LoRA Manager + Memory Pool
So far we acted as if all LoRA weights were already in one ready-to-use tensor (i.e. a_buffer in lora layer or weights in lora backend). In reality, multi-LoRA serving may involve hundreds or thousands of adapters. Each batch typically uses only a small subset. So we keep most adapters on CPU and only load active adapters to GPU. That is what LoRA Manager + Memory Pool handle.
LoRA Memory Pool
At initialization, LoRA Manager preloads adapters specified in --lora-paths to CPU and also performs module replacement.
Memory Pool (https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/mem_pool.py) handles CPU/GPU residency, slot assignment, and eviction.
LoRA Manager and Memory Pool initialize pool sizes based on max_loras_per_batch, max_lora_rank, and base-model hidden dimensions.
When preparing a sample batch:
- In non-overlap mode,
ForwardBatchcalls LoRA Managerfetch_new_loraswith required LoRA IDs. In overlap mode (--enable-lora-overlap-loading), this fetch is mainly triggered earlier by scheduler viaLoRAOverlapLoader(will dive into details later), and the path inforward_batch_infois skipped. - LoRA Manager calls Memory Pool
prepare_lora_batch. - In
prepare_lora_batch, Memory Pool allocates slots for new adapters in GPU; if full, it evicts by policy (e.g., LRU). LoRA layers have been bound to stable buffer references at init/update time (a_bufferandb_buffer); this phase only updates slot contents, not layer pointer rebinding. - Now nedded adapter weights are placed in GPU, LoRA layer calls backend for compute.
One subtle difference vs base-model weights: LoRA weights on GPU are not owned by a specific nn.Module parameter. LoRA layers typically just hold tensor references into pool-managed buffers, which keeps memory management flexible (where weights live, when they are swapped/evicted, and how they are updated).
LoRA Pre-Fetch (Overlap Loading)
Now the obvious concern: CPU->GPU LoRA loading can land on the inference critical path and increase latency.
SGLang provides --enable-lora-overlap-loading (core class is LoRAOverlapLoader: https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/lora_overlap_loader.py).
The idea:
- SGLang already overlaps CPU scheduling with GPU execution (previous batch still running while next batch is prepared).
- During this overlap window, start LoRA adapter migration early.
Operationally:
- In scheduler
_get_new_batch_prefill_raw, while scanning waiting queue, calltry_overlap_load_lorato check adapter status. - If adapter is
LOADING, request is skipped for this round; ifLOADED, request can continue participating in this round’s batching. - If adapter is
NOT_LOADEDandvalidate_lora_batch(new_lora_set)passes (capacity + pinned constraints), trigger async loading. Pool may still use eviction even when no empty slot exists. - Overlap loader uses a dedicated CUDA stream for loading and records an event to track completion.
With this design, first encounters of NOT_LOADED adapters are usually skipped once and trigger async loading. But if the same adapter has already been loaded by earlier requests, the request may enter batch immediately. Net effect: load time is often overlapped with previous GPU compute, reducing (not absolutely eliminating) impact on critical path.
Great design.
Summary
One-line role summary for each component:
- LoRA Adapter: loads LoRA weights to CPU and organizes them as adapter objects.
- LoRA Layer: wraps base layers and injects LoRA compute into forward.
- LoRA Backend: owns kernel implementation and per-batch metadata execution semantics.
- LoRA Manager: orchestrates lifecycle and scheduling-level adapter selection.
- Memory Pool: manages adapter residency, slot assignment, loading, and eviction across CPU/GPU.
- LoRA Overlap Loader: asynchronously preloads adapters on a dedicated stream to reduce critical-path loading cost.
Overall, SGLang’s LoRA design is elegant because it cleanly decouples concerns: memory management via Memory Pool, compute via Backend, and orchestration via Manager. The result is strong reuse and practical multi-LoRA serving.
That said, one clear challenge remains: to support LoRA for a model family, you need LoRA-aware wrappers for every relevant base-layer type. This is one reason SGLang has not yet fully landed MoE LoRA support; MoE is also more complex due to fused kernels. Progress is here (near completion): https://github.com/sgl-project/sglang/pull/14105
Finally, show some respect for the infra folks again!