代码实现
这里放的是可运行的最小实现。目录设计遵循“公式先行,代码跟进”的原则:每个核心脚本都尽量能被某一页数学推导或专题 walkthrough 直接解释。
推荐的阅读方式
先从
notes/里的 walkthrough 建立公式和张量形状。再回到这里看最小实现,把变量名和函数边界对上。
最后跑
tests/里的单测,确认这些公式在代码里真的成立。
重点源码地图
MHA / MQA / GQA 的最小 NumPy 实现
python src/attention/mha_gqa.py
RoPE cache、旋转、RMSNorm
python src/attention/rope_rmsnorm.py
FlashAttention 的在线 Softmax 模拟
python src/attention/flash_attn_sim.py
block allocator、Paged KV Cache、Copy-on-Write
python src/kv_cache/core.py
KV 的对称 / 非对称 per-channel 量化
python src/kv_cache/compression/quantizer.py
H2O / SnapKV 风格的 token 选择与压缩比
python src/kv_cache/compression/sparsifier.py
LRU、LFU、公平配额驱逐
python src/kv_cache/eviction/policies.py
continuous batching 和 decode 优先调度
python src/simulators/scheduler.py
TTFT、TPOT、Goodput、batch utilization
python src/simulators/serving_metrics.py
Little 定律、M/M/1、Erlang C、M/G/1、SLO 反推
python src/simulators/queueing_slo.py
MoE router、capacity、dispatch、drop rate
python src/simulators/moe_routing.py
从专题跳回源码的最短路径
Attention:先看 ../notes/attention/formula-to-code-walkthrough.md,再读
attention/。KV Cache:先看 ../notes/kv-cache/formula-to-code-walkthrough.md,再读
kv_cache/。KV Compression:先看 ../notes/kv-compression/formula-to-code-walkthrough.md,再读
kv_cache/compression/。KV Eviction:先看 ../notes/kv-eviction/formula-to-code-walkthrough.md,再读
kv_cache/eviction/。Serving:先看 ../notes/serving/formula-to-code-walkthrough.md 和 ../notes/serving/queueing-slo-formula-to-code-walkthrough.md,再读
simulators/。MoE:先看 ../notes/distributed/moe-formula-to-code-walkthrough.md,再读
simulators/moe_routing.py。
最后更新于