vLLM Docker#

Easy, fast, and cheap LLM serving for everyone

vLLM 是一个快速且易于使用的库,专为大型语言模型 (LLM) 的推理和部署而设计

Docker#

GPU#

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Mistral-7B-v0.1

CPU#

docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
docker run -it \
             --rm \
             --network=host \
             --cpuset-cpus=<cpu-id-list, optional> \
             --cpuset-mems=<memory-node, optional> \
             vllm-cpu-env

Windows#

docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host -v /d/models:/models vllm/vllm-openai --model /models/Qwen/Qwen2.5-0.5B --served-model-name Qwen/Qwen2.5-0.5B

pip#

python -m venv venv
# macOS/Linux
source venv/bin/activate
# Windows
venv\Scripts\activate
pip install vllm

API#

FastAPI:

Runtime Environment#

Architecture#

https://docs.vllm.ai/en/latest/_images/llm_engine.excalidraw.png

References#