The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution
- URL: http://arxiv.org/abs/2501.12407v4
- Date: Sun, 16 Feb 2025 21:42:09 GMT
- Title: The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution
- Authors: Frank Sifei Luan, Ziming Mao, Ron Yifeng Wang, Charlotte Lin, Amog Kamsetty, Hao Chen, Cheng Su, Balaji Veeramani, Scott Lee, SangBin Cho, Clark Zinzow, Eric Liang, Ion Stoica, Stephanie Wang,
- Abstract summary: We introduce the streaming batch model, a hybrid of the two models that enables efficient and fault-tolerant heterogeneous execution.<n>We present Ray Data, an implementation of the streaming batch model that improves throughput on heterogeneous batch inference pipelines by 3--8$times$ compared to traditional batch and stream processing systems.
- Score: 20.926218346718482
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While ML model training and inference are both GPU-intensive, CPU-based data processing is often the bottleneck. Distributed data processing systems based on the batch or stream processing models assume homogeneous resource requirements. They excel at CPU-based computation but either under-utilize heterogeneous resources or impose high overheads on failure and reconfiguration. We introduce the streaming batch model, a hybrid of the two models that enables efficient and fault-tolerant heterogeneous execution. The key idea is to execute one partition at a time to allow lineage-based recovery with dynamic resource allocation. This enables memory-efficient pipelining across heterogeneous resources, similar to stream processing, but also offers the elasticity and fault tolerance properties of batch processing. We present Ray Data, an implementation of the streaming batch model that improves throughput on heterogeneous batch inference pipelines by 3--8$\times$ compared to traditional batch and stream processing systems. When training Stable Diffusion, Ray Data matches the throughput of single-node ML data loaders while additionally leveraging distributed heterogeneous clusters to further improve training throughput by 31%.
Related papers
- StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)
StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.
Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z) - OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training [17.215899004049778]
We present OVERLORD, an industrial-grade distributed data loading architecture with three innovations.
OVERLORD achieves: (1) 4.5x end-to-end training throughput improvement; (2) a minimum 3.6x reduction in CPU memory usage.
arXiv Detail & Related papers (2025-04-14T03:31:22Z) - HybridFlow: A Flexible and Efficient RLHF Framework [13.80577212781375]
Reinforcement Learning from Human Feedback is widely used in Large Language Model (LLM) alignment.
Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN)
We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow.
arXiv Detail & Related papers (2024-09-28T06:20:03Z) - High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model.
In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - Efficient Parallel Reinforcement Learning Framework using the Reactor
Model [2.190190313041532]
Reinforcement Learning (RL) frameworks are essential for mapping RL workloads to multiple computational resources.
Existing frameworks, such as Ray, are not managing this orchestration efficiently.
We have proposed a solution implementing the reactor model, which enforces a set of actors to have a fixed communication pattern.
arXiv Detail & Related papers (2023-12-07T21:19:57Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Taming Resource Heterogeneity In Distributed ML Training With Dynamic
Batching [1.047192732651018]
Current techniques for distributed model training mostly assume that clusters are comprised of servers with a constant resource availability.
We develop a dynamic technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on availability and throughput.
arXiv Detail & Related papers (2023-05-20T15:33:06Z) - Latent Iterative Refinement for Modular Source Separation [44.78689915209527]
Traditional source separation approaches train deep neural network models end-to-end with all the data available at once.
We argue that we can significantly increase resource efficiency during both training and inference stages.
arXiv Detail & Related papers (2022-11-22T00:02:57Z) - Dual Generator Offline Reinforcement Learning [90.05278061564198]
In offline RL, constraining the learned policy to remain close to the data is essential.
In practice, GAN-based offline RL methods have not performed as well as alternative approaches.
We show that not only does having two generators enable an effective GAN-based offline RL method, but also approximates a support constraint.
arXiv Detail & Related papers (2022-11-02T20:25:18Z) - Drone Referring Localization: An Efficient Heterogeneous Spatial Feature Interaction Method For UAV Self-Localization [22.94589565476653]
We propose an efficient heterogeneous spatial feature interaction method, termed Drone Referring localization (DRL)
Unlike conventional methods that treat different data sources in isolation, DRL facilitates the learnable interaction of heterogeneous features.
Compared to traditional IR methods, DRL achieves superior localization accuracy (MA@20 +9.4%) while significantly reducing computational time (1/7) and storage overhead (2/3)
arXiv Detail & Related papers (2022-08-13T03:25:50Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.