VirtualFlow: Decoupling Deep Learning Models from the Underlying
Hardware
- URL: http://arxiv.org/abs/2009.09523v2
- Date: Tue, 11 May 2021 20:35:46 GMT
- Title: VirtualFlow: Decoupling Deep Learning Models from the Underlying
Hardware
- Authors: Andrew Or, Haoyu Zhang, Michael J. Freedman
- Abstract summary: State-of-the-art deep learning systems tightlycouple the model with the underlying hardware.
We propose VirtualFlow to decouple the model from the hardware.
In each step of training or inference, the batch of input data is split across virtual nodes instead of hardware accelerators.
- Score: 9.461227523454188
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art deep learning systems such as TensorFlow and PyTorch tightly
couple the model with the underlying hardware. This coupling requires the user
to modify application logic in order to run the same job across a different set
of resources, thereby limiting the choice of hardware for a given workload and
potentially forcing the user to forgo more efficient hardware configurations.
We propose VirtualFlow, a system leveraging a novel abstraction called
virtual node processing to decouple the model from the hardware. In each step
of training or inference, the batch of input data is split across virtual nodes
instead of hardware accelerators (e.g. GPUs and TPUs). Mapping multiple virtual
nodes to each accelerator and processing them sequentially effectively time
slices the batch, thereby allowing users to reduce the memory requirement of
their workloads and mimic large batch sizes on small clusters.
Using this technique, VirtualFlow enables many new use cases, such as
reproducing training results across different hardware, resource elasticity,
and heterogeneous training. In our evaluation, our implementation of
VirtualFlow for TensorFlow achieved strong convergence guarantees across
different hardware with out-of-the-box hyperparameters, up to 48% lower job
completion times with resource elasticity, and up to 42% higher throughput with
heterogeneous training.
Related papers
- TensorSocket: Shared Data Loading for Deep Learning Training [0.0]
Deep learning training is a repetitive and resource-intensive process.
socket enables simultaneous training processes to share the same data loader.
Our evaluation shows thatsocket enables scenarios that are infeasible without data sharing, increases training throughput by up to $100%$.
arXiv Detail & Related papers (2024-09-27T13:39:47Z) - Pipette: Automatic Fine-grained Large Language Model Training Configurator for Real-World Clusters [5.190794062263327]
Training large language models (LLMs) is known to be challenging because of the huge computational and memory capacity requirements.
We propose Pipette, which is an automatic fine-grained LLM training for real-world clusters.
arXiv Detail & Related papers (2024-05-28T11:59:44Z) - Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - Green AI: A Preliminary Empirical Study on Energy Consumption in DL
Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings.
This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z) - FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems [61.335229621081346]
Federated Learning (FL) has become a viable technique for realizing privacy-enhancing distributed deep learning on the network edge.
In this paper, we propose FLEdge, which complements existing FL benchmarks by enabling a systematic evaluation of client capabilities.
arXiv Detail & Related papers (2023-06-08T13:11:20Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Accelerating GAN training using highly parallel hardware on public cloud [0.3694429692322631]
This work explores different types of cloud services to train a Geneversarative Adversarial Network (GAN) in a parallel environment.
We parallelize the training process on multiple GPU and Google Processing Units (TPU)
Linear speed-up of the training process is obtained, while retaining most of the performance in terms of physics results.
arXiv Detail & Related papers (2021-11-08T16:59:15Z) - OneFlow: Redesign the Distributed Deep Learning Framework from Scratch [17.798586916628174]
OneFlow is a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model.
SBP enables much easier programming of data parallelism and model parallelism than existing frameworks.
OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks.
arXiv Detail & Related papers (2021-10-28T11:32:14Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - Neural Network Compression Framework for fast model inference [59.65531492759006]
We present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF)
It leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization.
The framework can be used within the training samples, which are supplied with it, or as a standalone package that can be seamlessly integrated into the existing training code.
arXiv Detail & Related papers (2020-02-20T11:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.