作者: Junhui He 邮箱:junhuihe.hjh@outlook.com
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
- DP
- 每个gpu都维护权重、梯度、优化器的完整副本
- Memory:
- Weight: n
- Activation: x
- Gradient: n
- Full precision weight: n
- First-order momentum: n
- Second-order momentum: n
- Communication:
- Gradient: 2n (all-reduce)
- Zero-1:
- 每个gpu都维护一个权重和梯度的副本;所有gpu协作维护一个优化器副本;注意优化器的切分是以tp而不是mp的方式进行的,以保证scatter-reduce时每台机器的通讯量相同
- Memory:
- Weight: n
- Activation: x
- Gradient: n
- Full precision weight: n/d
- First-order momentum: n/d
- Second-order momentum: n/d
- Communication
- Gradient: n (scatter-reduce)
- Updated weight: n (all-broadcast)
- Zero-2:
- 每个gpu仅维护一个权重副本;梯度在计算之后就立刻发送给它的优化器所在的gpu
- Memory:
- Memory:
- Weight: n
- Activation: x
- Gradient: n/d
- Full precision weight: n/d
- First-order momentum: n/d
- Second-order momentum: n/d
- Communication
- Gradient: n (per-tensor scatter-reduce)
- Updated weight: n (all-broadcast)
- Memory:
- Zero-3
- 连同权重一起tp切分放在对应机器上,forward时all-gather将每一个权重的切片同步给所有机器
- Memory:
- Weight: n/d
- Activation: x
- Gradient: n/d
- Full precision weight: n/d
- First-order momentum: n/d
- Second-order momentum: n/d
- Communication
- Weight: n (all-gather)
- Gradient: n (scatter-reduce)
- Updated weight: n (all-broadcast)
- Zero-R
- 在有tp的情况下,将activation切片保存到tp group的其中一台机器上,避免冗余的副本