技术笔记

这里是知识库的正文主体，面向“理解为什么这样设计”。如果数学字典解决的是“公式是什么”，这里解决的是“工程上为什么这么做、代价在哪里、替代方案是什么”。

建议阅读顺序

目标

建议起点

接下来读什么

先把推理主线跑通

KV Cache -> Inference -> Serving

先补系统设计

Distributed -> Frameworks -> Serving

先补训练与对齐

RL Infra -> Frontier

先看公式到实现

再对照 src/attention/*.py

attention/formula-to-code-walkthrough.md：把 Attention / GQA / RoPE / RMSNorm / FlashAttention 逐段映射到仓库代码。
kv-cache/formula-to-code-walkthrough.md：把 KV 容量账本、PagedAttention、量化、驱逐逐段映射到源码。
kv-compression/formula-to-code-walkthrough.md：把 KV 量化、误差、H2O / SnapKV 选择规则映射到压缩源码。
kv-eviction/formula-to-code-walkthrough.md：把 LRU / LFU / Fair quota 的评分函数映射到驱逐策略源码。
serving/formula-to-code-walkthrough.md：把 TTFT / TPOT / Goodput / 调度策略映射到指标与调度代码。
serving/queueing-slo-formula-to-code-walkthrough.md：把 Little 定律、M/M/1、Erlang C、M/G/1 映射到排队模拟代码。
distributed/moe-formula-to-code-walkthrough.md：把 MoE router、capacity、drop rate、All-to-All 映射到模拟器。
attention/mha-vs-gqa-full-derivation.md：GQA 为什么能显著降低 Decode 带宽。
attention/mha-vs-mla-full-derivation.md：MLA 的矩阵吸收和潜在空间压缩。
attention/mha-vs-dsa-full-derivation.md：DSA 的稀疏选择与双阶段结构。
attention/mha-vs-linear-attention-full-derivation.md：线性注意力如何把状态压缩成常数大小。

最后更新于13天前