SC25 gLLM

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling

几种并行方法

尝试消除泡泡:

目前在LLM推理里有两种不平衡:
stage 间不平衡
inter-stage dependency, where a stage cannot begin computation
until the preceding stage completes

比如两个GPU在算。例子1的GPU1算了1,然后传到GPU2,然后GPU2就闲置了,等GPU1算完2。例子2 的 GPU2算了1,但是GPU1想做3的话要等1传回来。

batch 间不平衡
where the number of concurrent micro-batches is limited by the pipeline depth.

PP 他们只要传结果数值