容量规划
核心公式
1. 单请求显存
KV_mem = 2 × n_layers × n_kv_heads × head_dim × dtype_bytes × seq_len
Model_mem = N_params × dtype_bytes2. 单 GPU 最大并发
max_concurrent = (GPU_mem - Model_mem - OS_overhead) / KV_mem_per_request3. 吞吐估算
throughput = batch_size / TPOT = batch_size × output_tokens_per_second4. GPU 数量
N_gpu = target_QPS × avg_latency / batch_size
= target_QPS / per_gpu_throughput示例:1000 QPS 的 Llama3-70B 服务
步骤
计算
容量规划清单
面试一句话
最后更新于