KV Cache 核心概念

为什么需要 KV Cache？

bytes_per_token = 2 × n_layers × n_kv_heads × head_dim × bytes_per_elem

例：Llama3-70B（GQA, n_layers=80, n_kv_heads=8, head_dim=128, bf16=2B） → 2 × 80 × 8 × 128 × 2 = 327,680 B ≈ 320 KB/token
128K context → ~40 GB/会话 → 必须做分页、压缩、驱逐

阶段

处理 token 数

瓶颈

KV 行为

Prefill

T_input（一次性）

Compute-bound

一次性写入 KV

Decode

每步 1 token

Memory-bound

读全量 KV + 追加 1 行

请求到达 → 分配块 → Prefill 填充 → Decode 追加 → 请求结束 → 释放块
                                                    ↑ 前缀可能被缓存复用

最后更新于13天前