Load Balancing for AI Training Workloads
- URL: http://arxiv.org/abs/2507.21372v1
- Date: Mon, 28 Jul 2025 22:34:18 GMT
- Title: Load Balancing for AI Training Workloads
- Authors: Sarah McClure, Sylvia Ratnasamy, Scott Shenker,
- Abstract summary: We investigate the performance of various load balancing algorithms for large-scale AI training workloads that are running on dedicated infrastructure.<n>The performance of load balancing depends on both the congestion control and loss recovery algorithms, so our evaluation also sheds light on the appropriate choices for those designs as well.
- Score: 4.6874900353446325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the performance of various load balancing algorithms for large-scale AI training workloads that are running on dedicated infrastructure. The performance of load balancing depends on both the congestion control and loss recovery algorithms, so our evaluation also sheds light on the appropriate choices for those designs as well.
Related papers
- Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling [1.3689475854650441]
This study proposes a comprehensive scalability optimization framework for cloud AI inference services.<n>The proposed model is a hybrid approach that combines reinforcement learning for adaptive load distribution and deep neural networks for accurate demand forecasting.<n> Experimental results demonstrate that the proposed model enhances load balancing efficiency by 35 and reduces response delay by 28.
arXiv Detail & Related papers (2025-04-16T04:00:04Z) - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z) - Reinforcement Learning-Based Adaptive Load Balancing for Dynamic Cloud Environments [0.0]
We propose a novel adaptive load balancing framework using Reinforcement Learning (RL) to address these challenges.
Our framework is designed to dynamically reallocate tasks to minimize latency and ensure balanced resource usage across servers.
Experimental results show that the proposed RL-based load balancer outperforms traditional algorithms in terms of response time, resource utilization, and adaptability to changing workloads.
arXiv Detail & Related papers (2024-09-07T19:40:48Z) - Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts [13.413587367600444]
An unbalanced expert load will lead to routing collapse or increased computational overhead.
We propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy.
We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens.
arXiv Detail & Related papers (2024-08-28T09:31:09Z) - OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance [65.48009829137824]
Large-scale 3D parallel training on vision-language instruction-tuning models leads to an imbalanced computation load across different devices.<n>We rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices.<n>Our method's efficacy and generalizability are further validated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z) - Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations [62.132347451049455]
Scale has become a main ingredient in obtaining strong machine learning models.
In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule.
We show that weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales.
arXiv Detail & Related papers (2024-05-28T17:33:54Z) - Switchable Decision: Dynamic Neural Generation Networks [98.61113699324429]
We propose a switchable decision to accelerate inference by dynamically assigning resources for each data instance.
Our method benefits from less cost during inference while keeping the same accuracy.
arXiv Detail & Related papers (2024-05-07T17:44:54Z) - Overcoming Recency Bias of Normalization Statistics in Continual
Learning: Balance and Adaptation [67.77048565738728]
Continual learning involves learning a sequence of tasks and balancing their knowledge appropriately.
We propose Adaptive Balance of BN (AdaB$2$N), which incorporates appropriately a Bayesian-based strategy to adapt task-wise contributions.
Our approach achieves significant performance gains across a wide range of benchmarks.
arXiv Detail & Related papers (2023-10-13T04:50:40Z) - Communication Load Balancing via Efficient Inverse Reinforcement
Learning [13.052338083552863]
In this work, we tackle the communication load balancing problem from an inverse reinforcement learning (IRL) approach.
We infer a reward function from a set of demonstrations, and then learn a reinforcement learning load balancing policy with the inferred reward function.
Compared to classical RL-based solution, the proposed solution can be more general and more suitable for real-world scenarios.
arXiv Detail & Related papers (2023-03-22T22:23:23Z) - DL-DRL: A double-level deep reinforcement learning approach for
large-scale task scheduling of multi-UAV [65.07776277630228]
We propose a double-level deep reinforcement learning (DL-DRL) approach based on a divide and conquer framework (DCF)
Particularly, we design an encoder-decoder structured policy network in our upper-level DRL model to allocate the tasks to different UAVs.
We also exploit another attention based policy network in our lower-level DRL model to construct the route for each UAV, with the objective to maximize the number of executed tasks.
arXiv Detail & Related papers (2022-08-04T04:35:53Z) - Multi-Agent Reinforcement Learning for Network Load Balancing in Data
Center [4.141301293112916]
This paper presents the network load balancing problem, a challenging real-world task for reinforcement learning methods.
The cooperative network load balancing task is formulated as a Dec-POMDP problem, which naturally induces the MARL methods.
To bridge the reality gap for applying learning-based methods, all methods are directly trained and evaluated on an emulation system.
arXiv Detail & Related papers (2022-01-27T18:47:59Z) - Reinforced Workload Distribution Fairness [3.7384509727711923]
This paper proposes a distributed reinforcement learning mechanism to-with no active load balancer state monitoring and limited network observations-improve the fairness of the workload distribution achieved by a load balancer.
Preliminary results show promise in RLbased load balancing algorithms, and identify additional challenges and future research directions.
arXiv Detail & Related papers (2021-10-29T07:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.