☀️

Hello World.

Distributed Training Research

LLM Training

  • NSDI 24, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
  • ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
  • DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
  • Colossal-AI A Unified Deep Learning System For Large-Scale Parallel Training
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Gradient Compression

Model Parallelism

In-network Aggregation

See this post for details.

Asynchronous Training

  • Developing a Loss Prediction-based Asynchronous Stochastic Gradient Descent Algorithm for Distributed Training of Deep Neural Networks:这篇文章介绍了异步更新的模式。看来参数服务器并没有一个具体的阈值确定何时进行异步更新。这篇文章提出一个算法以弥补梯度损失值。
  • ICDCS 19, Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning:提出一种自适应确定延迟轮数方法。
  • Staleness-aware Async-SGD for Distributed Deep Learning: 针对异步分布式模型训练调整学习率。
  • Pathways: Asynchronous Distributed Dataflow for ML: PATHWAYS makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane
  • Accelerating Distributed Reinforcement Learning with In-Switch Computing:这篇文章提到了使用可编程交换机优化异步训练,理论上来说,我们是可以转移到深度神经网络场景的。这篇文章仅仅通过显性设置延迟边界保证收敛。
  • FedLesScan: Mitigating Stragglers in Serverless Federated Learning
  • Communication-Efficient Federated Deep Learning With Layerwise Asynchronous Model Update and Temporally Weighted Aggregation:模型不同层的更新频率不一样,一般来说,shallow 层的参数比 deep 层的参数更新更加频繁,因此本文提出根据不同层的更新频率异步训练。本文还提出一个加权聚合策略,从而利用之前聚合的本地模型。
  • AMPNet: Asynchronous Model-Parallel Training for Dynamic Neural Networks
  • INT Lab, FedSA: A Semi-Asynchronous Federated Learning Mechanism in Heterogeneous Edge Computing:在联邦学习场景中,由于节点异构,数据分布以及网络资源等问题,同步学习需要忍受很大的同步代价。因此现有研究关注异步学习。这篇论文确定每轮训练中,参数服务器收到哪k个工作节点的梯度才更新全局模型。这篇论文根据边缘异构性数据分布决定k的值。
  • INT Lab, Adaptive Asynchronous Federated Learning in Resource-Constrained Edge Computing:这篇文章根据实时系统状态,例如网络资源,为每一轮异步训练确定聚合的工作节点比例。和上一篇工作内容相似。这里是根据全局信息确定的M,感觉参考意义不大

Straggler Problem

GPU Scheduling

Resource Fragmentation Problem

— Mar 18, 2024