[Tech] How Does SGLang Actually Load Model Weights?
When you spin up an LLM with SGLang, you get this satisfying sequence: progress bars flying, safetensors loading, and then — boom — the model starts generating tokens. But have you ever wondered what’s actually happening under the hood? From allocating GPU memory, to reading weights off disk, to wiring everything together so the model can actually run inference — there’s a lot going on. Specifically, questions like: How does SGLang know which module to use to load a given model? How does it know which weight in a safetensors file maps to which module? And even if it knows the module, how does it know which parameter inside that module a weight belongs to? A single module (like an MLP layer) gets reused across many layers — how does each instance know which weights are “its own”? I recently had to dig into all of this while adding support for a new model in SGLang, so let’s walk through it together. ...