vLLM
High-throughput, memory-efficient inference engine for LLMs
About
vLLM is an inference and serving engine for large language models that uses PagedAttention to manage memory efficiently. It supports a wide range of open-source models across various hardware platforms including NVIDIA, AMD, and Apple Silicon.