[Tech] How Does SGLang Support LoRA?

I recently started looking into LoRA + RL, and as a side project I dug into how SGLang implements LoRA serving. This post is a concise walkthrough.

TL;DR

At a high level, SGLang’s LoRA support looks like this (source: https://arxiv.org/abs/2311.03285): it separates base-model computation from LoRA-adapter computation, computes them independently, and adds them together. This enables one base model to serve multiple adapters.

To batch requests that use different adapters, SGLang uses SGMV (Segmented Gather Matrix-Vector Multiplication) (source: https://arxiv.org/pdf/2310.18547), so a mixed-adapter batch can still be handled efficiently with shared kernels.

For multi-LoRA serving, SGLang keeps a LoRA pool in main memory (see figure below). It only loads adapters needed by the current workload into VRAM; the rest stay in CPU memory. To reduce loading overhead, you can enable overlap loading and prefetch adapters while the previous batch is still running.

That is the short version. Now let’s break it down in more detail.

LoRA Repository

If you want background on base-model weight loading, see my other blog.

Let’s first look at what a LoRA repo contains. Using Qwen3-4B LoRA as an example, you typically get:

adapter_config.json
adapter_model.safetensors
README.md

In adapter_config.json, below are the fields most relevant to SGLang:

{
  "base_model_name_or_path": "Qwen/Qwen3-4B", # base model name
  "lora_alpha": 32, # effective scaling is alpha / r
  "peft_type": "LORA", # must be "LORA"
  "r": 8, # LoRA rank
  "target_modules": [
    "q_proj",
    "v_proj"
  ] # which modules to apply LoRA to
}

Now check what is inside safetensors. LoRA follows:

W' = W + scale * (B @ A), where scale = lora_alpha / r.

So LoRA params always come in A/B pairs, e.g.:

base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight  shape=(8, 2560) # 8 is lora rank (r)
base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight  shape=(4096, 8)

And the corresponding base-model weight in Qwen-4B is:

model.layers.0.self_attn.q_proj.weight shape (4096, 2560)

The shapes are compatible.

LoRA Module Discovery and Replacement

To load LoRA weights, SGLang mainly does two things:

Load adapter weights to CPU and organize them (via LoRAAdapter class).
Replace matching base-model modules with LoRA-aware wrappers (core logic in init_lora_modules in lora_manager.py).

A key detail: SGLang normalizes module names first. For fused implementations (like Qwen3 attention), q_proj/k_proj/v_proj are normalized to qkv_proj, and gate_proj/up_proj to gate_up_proj. So replacement usually happens on fused module names.

Using the same example (q_proj and v_proj in LoRA config), SGLang does:

Load LoRA adapter weights to CPU.
Normalize target module names to fused granularity (e.g., qkv_proj).
Normalize LoRA weights (_normalize_weights) to match fused base-model structure:
- Find all q_proj LoRA items, e.g.: base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight
- Resolve k_proj / v_proj by name; if k_proj is missing, fill with zeros.
- Concatenate q/k/v along output dimension into qkv_proj, e.g.: base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.weight
- Drop old q_proj/k_proj/v_proj entries and keep fused outputs.

After these two normalization steps (module names + weights), LoRA weight granularity matches base-model module granularity, enabling stable module replacement and loading.

For example, in Qwen3, base weight model.layers.0.self_attn.qkv_proj.weight corresponds to QKVParallelLinear, and its corresponding LoRA wrapper is QKVParallelLinearWithLoRA (see get_lora_layer mapping in lora/layers.py).

LoRA Layers

So what exactly is different between a replaced LoRA layer and a normal layer?

At a high level, a LoRA layer is a wrapper that adds LoRA capabilities to a base layer.

Take QKVParallelLinearWithLoRA as an example (code: https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/layers.py#L421):

    def __init__(
        self,
        base_layer: QKVParallelLinear, # wrapped base target
        lora_backend: BaseLoRABackend, # backend that runs LoRA compute
    ) -> None:

It is connected to LoRA memory pool buffers via set_lora_info:

    def set_lora_info(
        self,
        A_buffer: torch.Tensor,
        B_buffer: torch.Tensor,
    ):
        self.set_lora = True
        self.A_buffer = A_buffer
        self.B_buffer = B_buffer

The key runtime path is forward (inherited from ColumnParallelLinearWithLoRA):

Run base-layer computation.
Run LoRA computation through backend and add it to base output.

Exactly like the first diagram.

Important detail: a layer instance is not bound to one specific adapter. Multiple adapters (say 1/2/3 in the diagram) all reside in the layer’s A/B buffer pool, and per-batch routing is handled by compute backend.

LoRA layers also provide TP-specific support (e.g., rank-local slicing rules). Skipped here; see https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/layers.py#L467-L495

LoRA Backend

LoRA backend is where actual LoRA compute happens. Start from the base class (BaseLoRABackend) and a minimal method:

    def run_lora_a_sgemm(
        self,
        x: torch.Tensor,
        weights: torch.Tensor,
        *args,
        **kwargs
    ) -> torch.Tensor:

This computes the LoRA-A part (x @ A^T).

x has shape (s, input_dim), where s is total flattened token count in the current batch.
weights has shape (num_lora, c * r, input_dim).

num_lora corresponds to max_loras_per_batch (pool slot count per batch).

For r: the tensor capacity is allocated with max_lora_rank, but each adapter has its own real rank config.r, and runtime uses lora_ranks (see below function) to apply the effective rank (r <= max_lora_rank).

c is a stack multiplier (e.g., 2 for gate_up, 3 for qkv). There are also specialized fused paths like run_qkv_lora.

Besides compute methods (e.g., run_lora_b_sgemm), backend also needs routing metadata via prepare_lora_batch:

    def prepare_lora_batch(
        self,
        forward_batch: ForwardBatch, # sequence lengths, mode, etc.
        weight_indices: list[int],   # which LoRA slot each sequence uses
        lora_ranks: list[int],       # rank per LoRA slot (real rank)
        scalings: list[float],       # scaling per LoRA slot
        use_cuda_graph: bool,
    ):

So per forward pass: backend first prepares batch metadata, then runs compute. Thus, one batch metadata can be reused in different compute.

SGLang provides multiple backend implementations:

torch_backend: naive/reference torch implementation
chunked_backend: default backend (explained below)
triton_backend.py: high-performance Triton implementation
ascend_backend.py: Ascend NPU implementation

This backend abstraction cleanly decouples compute implementation from scheduling/memory logic.

Now let’s zoom into default ChunkedSgmvLoRABackend, based on SGMV (https://arxiv.org/abs/2310.18547). Starting from prepare_lora_batch, it does:

Determine chunk_size. The goal is not simply “one kernel compute per adapter”; it is to avoid very long segments (e.g., 90% of tokens using one adapter) causing imbalance and long-tail latency. Chunking long segments improves parallelism and stabilizes throughput/latency.
Logical reordering (group by adapter). At token level, it computes weight_index per token, then argsort to get permutation, so tokens for the same adapter become contiguous logically. Example: adapter IDs [1,1,0,1,0] become [0,0,1,1,1], with permutation=[2,4,0,1,3]. permutation[logical_idx] = physical_idx. This is not physical tensor reorder; it is index mapping used by kernels for indirect read/write.
Build segment metadata (seg_weight_indices, seg_indptr) and pack batch_info. A segment is a token group using one LoRA slot. Chunking happens here: long groups are split by chunk_size. seg_indptr is CSR-style boundaries; seg_weight_indices stores LoRA slot per segment.
Copy rank/scaling and metadata to GPU. CPU tensors use pin_memory=True and GPU copy uses non_blocking=True for efficient asynchronous H2D transfer.

The comments in this file are very detailed: https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/backend/chunked_backend.py

After metadata is ready, kernels run. For example, run_lora_a_sgemm calls chunked_sgmv_lora_shrink_forward with x, weights, and prepared metadata. Conceptually, reorder + chunking enables one kernel pipeline to execute mixed-adapter LoRA workloads efficiently.

Kernel entry (for interested readers): https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/triton_ops/chunked_sgmv_shrink.py#L122

LoRA Manager + Memory Pool

So far we acted as if all LoRA weights were already in one ready-to-use tensor (i.e. a_buffer in lora layer or weights in lora backend). In reality, multi-LoRA serving may involve hundreds or thousands of adapters. Each batch typically uses only a small subset. So we keep most adapters on CPU and only load active adapters to GPU. That is what LoRA Manager + Memory Pool handle.

At initialization, LoRA Manager preloads adapters specified in --lora-paths to CPU and also performs module replacement.

Memory Pool (https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/mem_pool.py) handles CPU/GPU residency, slot assignment, and eviction.

LoRA Manager and Memory Pool initialize pool sizes based on max_loras_per_batch, max_lora_rank, and base-model hidden dimensions.

When preparing a sample batch:

In non-overlap mode, ForwardBatch calls LoRA Manager fetch_new_loras with required LoRA IDs. In overlap mode (--enable-lora-overlap-loading), this fetch is mainly triggered earlier by scheduler via LoRAOverlapLoader (will dive into details later), and the path in forward_batch_info is skipped.
LoRA Manager calls Memory Pool prepare_lora_batch.
In prepare_lora_batch, Memory Pool allocates slots for new adapters in GPU; if full, it evicts by policy (e.g., LRU). LoRA layers have been bound to stable buffer references at init/update time (a_buffer and b_buffer); this phase only updates slot contents, not layer pointer rebinding.
Now nedded adapter weights are placed in GPU, LoRA layer calls backend for compute.

One subtle difference vs base-model weights: LoRA weights on GPU are not owned by a specific nn.Module parameter. LoRA layers typically just hold tensor references into pool-managed buffers, which keeps memory management flexible (where weights live, when they are swapped/evicted, and how they are updated).

LoRA Pre-Fetch (Overlap Loading)

Now the obvious concern: CPU->GPU LoRA loading can land on the inference critical path and increase latency.

SGLang provides --enable-lora-overlap-loading (core class is LoRAOverlapLoader: https://github.com/ChangyiYang/sglang-changyi/blob/main/python/sglang/srt/lora/lora_overlap_loader.py).

The idea:

SGLang already overlaps CPU scheduling with GPU execution (previous batch still running while next batch is prepared).
During this overlap window, start LoRA adapter migration early.

Operationally:

In scheduler _get_new_batch_prefill_raw, while scanning waiting queue, call try_overlap_load_lora to check adapter status.
If adapter is LOADING, request is skipped for this round; if LOADED, request can continue participating in this round’s batching.
If adapter is NOT_LOADED and validate_lora_batch(new_lora_set) passes (capacity + pinned constraints), trigger async loading. Pool may still use eviction even when no empty slot exists.
Overlap loader uses a dedicated CUDA stream for loading and records an event to track completion.

With this design, first encounters of NOT_LOADED adapters are usually skipped once and trigger async loading. But if the same adapter has already been loaded by earlier requests, the request may enter batch immediately. Net effect: load time is often overlapped with previous GPU compute, reducing (not absolutely eliminating) impact on critical path.

Great design.

Summary

One-line role summary for each component:

LoRA Adapter: loads LoRA weights to CPU and organizes them as adapter objects.
LoRA Layer: wraps base layers and injects LoRA compute into forward.
LoRA Backend: owns kernel implementation and per-batch metadata execution semantics.
LoRA Manager: orchestrates lifecycle and scheduling-level adapter selection.
Memory Pool: manages adapter residency, slot assignment, loading, and eviction across CPU/GPU.
LoRA Overlap Loader: asynchronously preloads adapters on a dedicated stream to reduce critical-path loading cost.

Overall, SGLang’s LoRA design is elegant because it cleanly decouples concerns: memory management via Memory Pool, compute via Backend, and orchestration via Manager. The result is strong reuse and practical multi-LoRA serving.

That said, one clear challenge remains: to support LoRA for a model family, you need LoRA-aware wrappers for every relevant base-layer type. This is one reason SGLang has not yet fully landed MoE LoRA support; MoE is also more complex due to fused kernels. Progress is here (near completion): https://github.com/sgl-project/sglang/pull/14105

Finally, show some respect for the infra folks again!

LoRA Repository#

LoRA Module Discovery and Replacement#

LoRA Layers#

LoRA Backend#

LoRA Manager + Memory Pool#

LoRA Pre-Fetch (Overlap Loading)#

Summary#