Skip to main content
vLLM logo

vLLM

High-throughput, memory-efficient inference engine for LLMs

About

vLLM is an inference and serving engine for large language models that uses PagedAttention to manage memory efficiently. It supports a wide range of open-source models across various hardware platforms including NVIDIA, AMD, and Apple Silicon.