DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems
- URL: http://arxiv.org/abs/2510.08522v1
- Date: Thu, 09 Oct 2025 17:48:24 GMT
- Title: DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems
- Authors: Yuanjun Dai, Keqiang He, An Wang,
- Abstract summary: Existing batch size selection approaches rely on static allocation or simplistics that fail to adapt to heterogeneous, dynamic computing environments.<n>We present DYNAmix, a reinforcement learning framework that formulates batch size optimization as a sequen- tial decision-making problem using Proximal Policy Optimiza- tion (PPO)<n>Our approach employs a multi-dimensional state representation encompassing network-level metrics, system-level resource utilization, and training statistical efficiency indicators to enable informed decision-making across diverse computational resources.
- Score: 2.472349172396126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing batch size selection approaches in dis- tributed machine learning rely on static allocation or simplistic heuristics that fail to adapt to heterogeneous, dynamic computing environments. We present DYNAMIX, a reinforcement learning framework that formulates batch size optimization as a sequen- tial decision-making problem using Proximal Policy Optimiza- tion (PPO). Our approach employs a multi-dimensional state representation encompassing network-level metrics, system-level resource utilization, and training statistical efficiency indicators to enable informed decision-making across diverse computational resources. Our approach eliminates the need for explicit system modeling while integrating seamlessly with existing distributed training frameworks. Through evaluations across diverse work- loads, hardware configurations, and network conditions, DY- NAMIX achieves up to 6.3% improvement in the final model accuracy and 46% reduction in the total training time. Our scalability experiments demonstrate that DYNAMIX maintains the best performance as cluster size increases to 32 nodes, while policy transfer experiments show that learned policies generalize effectively across related model architectures.
Related papers
- Resource-Aware Aggregation and Sparsification in Heterogeneous Ensemble Federated Learning [0.9176056742068811]
Federated learning (FL) enables distributed training with private client data.<n>Current ensemble-based FL methods fall short in capturing diversity of model predictions.<n>We propose textbfSHEFL, a global ensemble-based FL framework suited for clients with diverse computational capacities.
arXiv Detail & Related papers (2025-08-12T01:40:46Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - LAPSO: A Unified Optimization View for Learning-Augmented Power System Operations [3.754570687412345]
This paper proposes a holistic framework of Learning-Augmented Power System Operations (LAPSO)<n>LAPSO is centered on the operation stage and aims to break the boundary between temporally siloed power system tasks.<n>A dedicated Python package-lapso is introduced to automatically augment existing power system optimization models with learnable components.
arXiv Detail & Related papers (2025-05-08T13:00:24Z) - Integrating Personalized Federated Learning with Control Systems for Enhanced Performance [0.0]
This paper introduces a novel framework that amalgamates personalized federated learning with robust control systems.<n>Our approach harnesses personalized algorithms that adapt to the unique characteristics of each client's data.<n>We demonstrate that our integrated system outperforms standard federated learning models in terms of accuracy and learning speed.
arXiv Detail & Related papers (2025-01-27T01:52:15Z) - Adaptive Layer Splitting for Wireless LLM Inference in Edge Computing: A Model-Based Reinforcement Learning Approach [18.153641696306707]
This study introduces a framework taking inspiration from model-based reinforcement learning (MBRL) to determine the optimal splitting point across the edge and user equipment (UE)
By incorporating a reward surrogate model, our approach significantly reduces the computational cost of frequent performance evaluations.
arXiv Detail & Related papers (2024-06-03T09:41:42Z) - Context-Aware Orchestration of Energy-Efficient Gossip Learning Schemes [8.382766344930157]
We present a distributed training approach based on the combination of Gossip Learning with adaptive optimization of the learning process.
We propose a data-driven approach to OGL management that relies on optimizing in real-time for each node.
Results suggest that our approach is highly efficient and effective in a broad spectrum of network scenarios.
arXiv Detail & Related papers (2024-04-18T09:17:46Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Federated Conditional Stochastic Optimization [110.513884892319]
Conditional optimization has found in a wide range of machine learning tasks, such as in-variant learning tasks, AUPRC, andAML.
This paper proposes algorithms for distributed federated learning.
arXiv Detail & Related papers (2023-10-04T01:47:37Z) - Distributionally Robust Model-based Reinforcement Learning with Large
State Spaces [55.14361269378122]
Three major challenges in reinforcement learning are the complex dynamical systems with large state spaces, the costly data acquisition processes, and the deviation of real-world dynamics from the training environment deployment.
We study distributionally robust Markov decision processes with continuous state spaces under the widely used Kullback-Leibler, chi-square, and total variation uncertainty sets.
We propose a model-based approach that utilizes Gaussian Processes and the maximum variance reduction algorithm to efficiently learn multi-output nominal transition dynamics.
arXiv Detail & Related papers (2023-09-05T13:42:11Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training [42.514897110537596]
Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train.
designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task.
We introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training.
arXiv Detail & Related papers (2022-11-30T00:32:37Z) - Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems.
Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC.
We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.