参考连接：https://github.com/vllm-project/vllm

# 一、vLLM 介绍

大型语言模型（LLMs）承诺将彻底改变我们在所有行业中使用人工智能的方式。然而，实际上部署这些模型是具有挑战性的，并且即使在昂贵的硬件上也可能出人意料地慢。

vLLM 是一个用于快速大型语言模型推理和服务的开源库。vLLM 利用了新注意力算法 PagedAttention，它有效地管理注意力键和值（KV cache）。

配备 PagedAttention 的 vLLM 重新定义了大型语言模型服务的新标准：它提供的吞吐量比 HuggingFace Transformers 高出多达 24 倍，而无需对模型架构进行任何更改。

# 二、PagedAttention 原理

在 vLLM 中，LLM 的性能受到内存的限制。在自回归解码过程中，所有输入到 LLM 的 token 都会生成它们的注意力键 (attention key) 和值张量 (value tensors)，而这些张量被保存在 GPU 内存中以生成下一个 token。这些缓存的键和值张量通常被称为 KV cache。KV cache 的特点为：

大型：在 LLaMA-13B 中，单个序列可能占用高达 1.7GB。
动态：其大小取决于序列长度，这是高度可变且不可预测的。因此，有效管理 KV 缓存是一个重大挑战。由于碎片化和过度预留，现有系统浪费了 60% 至 80% 的内存。

为了解决这个问题，vLLM 引入了一种名为 PagedAttention 的注意力算法，它受到了操作系统中虚拟内存和分页这一经典思想的启发。与传统的注意力算法不同，PagedAttention 允许在非连续的内存空间中存储连续的键和值。具体来说，PagedAttention 将每个序列的键值（KV）缓存划分为多个块，每个块包含固定数量的 token 的 keys 和 values。在注意力计算过程中，PagedAttention 内核能够高效地识别并获取这些块。

因为块在内存中不需要是连续的，可以像操作系统的虚拟内存那样以更灵活的方式管理 keys 和 values：可以将块看作是分页，tokens 看作是字节，sequences 看作是进程。sequence 的连续逻辑块通过块表映射到不连续的物理块。随着新 token 的生成，物理块按需分配。

在 PagedAttention 中，内存浪费仅发生在序列的最后一个区块中。在实践中，这导致接近最优的内存使用，仅有不到 4% 的微小浪费。这种内存效率的提升被证明是非常有益的：它允许系统将更多的序列一起批处理，提高 GPU 利用率，从而显著提高吞吐量。

PagedAttention 还有另一个关键优势：高效的内存共享。例如，在并行采样中，可以从同一个 prompt 生成多个输出序列。在这种情况下，对该 prompt 的计算和内存可以在输出序列之间共享。

PagedAttention 通过其块表自然地实现了内存共享。类似于进程共享物理页面的方式，不同序列在 PagedAttention 中可以通过将它们的逻辑块映射到同一个物理块来共享块。为了确保安全的共享，PagedAttention 会跟踪物理块的引用计数，并实现了写时复制（Copy-on-Write）机制。

PageAttention 的内存共享大大减少了复杂采样算法的内存开销，如并行采样和束搜索，将它们的内存使用量减少了高达 55%。这可以转化为吞吐量高达 2.2 倍的提升。这使得这些采样方法在 LLM 服务中变得实用。

PagedAttention 是 vLLM 背后的核心技术，vLLM 是我们的大型语言模型（LLM）推理和服务引擎，它支持多种模型，具有高性能和易于使用的界面。

# 三、开始使用 vLLM

使用如下命令安装 vLLM：

pip install vllm

vLLM 可以用于离线推理和在线服务。要使用 vLLM 进行离线推理，可以在 Python 脚本中导入 vLLM 并使用 LLM 类：

	from vllm import LLM, SamplingParams


	prompts = [
	"Hello, my name is",
	"The president of the United States is",
	"The capital of France is",
	"The future of AI is",
	]
	sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

	llm = LLM(model="facebook/opt-125m")

	outputs = llm.generate(prompts, sampling_params)

	for output in outputs:
	prompt = output.prompt
	generated_text = output.outputs[0].text
	print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

运行结果如下：

	(vllm) ember@ember-Victus-by-HP-Laptop:~/project/python/1_vllm$ python main.py
	INFO 10-30 17:14:44 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
	INFO 10-30 17:14:45 model_runner.py:1056] Starting to load model facebook/opt-125m...
	INFO 10-30 17:14:45 weight_utils.py:243] Using model weights format ['*.bin']
	Loading pt checkpoint shards: 0% Completed \| 0/1 [00:00<?, ?it/s]
	/home/ember/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/weight_utils.py:425: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
	state = torch.load(bin_file, map_location="cpu")
	Loading pt checkpoint shards: 100% Completed \| 1/1 [00:00<00:00, 2.61it/s]
	Loading pt checkpoint shards: 100% Completed \| 1/1 [00:00<00:00, 2.61it/s]

	INFO 10-30 17:14:46 model_runner.py:1067] Loading model weights took 0.2389 GB
	INFO 10-30 17:14:46 gpu_executor.py:122] # GPU blocks: 4774, # CPU blocks: 7281
	INFO 10-30 17:14:46 gpu_executor.py:126] Maximum concurrency for 2048 tokens per request: 37.30x
	INFO 10-30 17:14:48 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
	INFO 10-30 17:14:48 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
	INFO 10-30 17:14:58 model_runner.py:1523] Graph capturing finished in 10 secs.
	Processed prompts: 100%\|████\| 4/4 [00:00<00:00, 29.75it/s, est. speed input: 193.42 toks/s, output: 476.10 toks/s]
	Prompt: 'Hello, my name is', Generated text: " Joel. I'm a 24 year old software developer and I'm looking for a"
	Prompt: 'The president of the United States is', Generated text: ', in my opinion, the worst person ever to be president.\n> In'
	Prompt: 'The capital of France is', Generated text: ' wrong because a country is full of self-doubts about its own future'
	Prompt: 'The future of AI is', Generated text: ' in AI and a lot of it is just making sure everyone has an AI that'

# 一、vLLM 介绍

# 二、PagedAttention 原理

# 三、开始使用 vLLM

docker

Ubuntu 20.04 Nvidia驱动安装