Related papers: TensorSocket: Shared Data Loading for Deep Learning Training

TensorSocket: Shared Data Loading for Deep Learning Training

URL: http://arxiv.org/abs/2409.18749v1
Date: Fri, 27 Sep 2024 13:39:47 GMT
Title: TensorSocket: Shared Data Loading for Deep Learning Training
Authors: Ties Robroek, Neil Kim Nielsen, Pınar Tözün,
Abstract summary: Deep learning training is a repetitive and resource-intensive process. socket enables simultaneous training processes to share the same data loader. Our evaluation shows thatsocket enables scenarios that are infeasible without data sharing, increases training throughput by up to $100%$.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training deep learning models is a repetitive and resource-intensive process. Data scientists often train several models before landing on set of parameters (e.g., hyper-parameter tuning), model architecture (e.g., neural architecture search), among other things that yields the highest accuracy. The computational efficiency of these training tasks depends highly on how well we can supply the training process with training data. The repetitive nature of these tasks results in the same data processing pipelines running over and over exacerbating the need for and costs of computational resources. In this paper, we present Tensorsocket to reduce the computational needs of deep learning training by enabling simultaneous training processes to share the same data loader. Tensorsocket mitigates CPU-side bottlenecks in cases where the collocated training workloads have high throughput on GPU, but are held back by lower data-loading throughput on CPU. Tensorsocket achieves this by reducing redundant computations across collocated training processes and leveraging modern GPU-GPU interconnects. We demonstrate the hardware- and pipeline-agnostic nature of Tensorsocket and evaluate it using a variety of training scenarios. Our evaluation shows that Tensorsocket enables scenarios that are infeasible without data sharing, increases training throughput by up to $100\%$, and when utilizing cloud instances, Tensorsocket achieves cost savings of $50\%$ by reducing the hardware resource needs on the CPU side. Furthermore, Tensorsocket outperforms the state-of-the-art solutions for shared data loading such as CoorDL and Joader. It is easier to use, maintain, and deploy, and either achieves higher or matches the throughput of other solutions while requiring less CPU resources.

Related papers

Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs [8.00550423071637]
This dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs) It proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack.
arXiv Detail & Related papers (2024-12-06T03:20:03Z)
Efficient Tabular Data Preprocessing of ML Pipelines [9.23424733090734]
Data preprocessing pipelines are a crucial component of Machine Learning (ML) training. Piper is a hardware accelerator for data preprocessing, prototype it on FPGAs, and demonstrate its potential for training pipelines of commercial recommender systems. Piper achieves 4.7 $sim$ 71.3$times$ speedup in latency over a 128-core CPU server and outperforms a data-center GPU by 4.8$sim$ 20.3$times$ when using binary input.
arXiv Detail & Related papers (2024-09-23T11:07:57Z)
Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models. This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments. This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z)
Efficient Asynchronous Federated Learning with Sparsification and Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data. FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training. We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z)
Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets. DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z)
FFCV: Accelerating Training by Removing Data Bottlenecks [84.89623507733963]
We present FFCV, a library for easy and fast machine learning model training. It speeds up model training by eliminating (often subtle) data bottlenecks from the training process. Detailed installation instructions, documentation, and Slack support channel are available at https://ffcv.io/.
arXiv Detail & Related papers (2023-06-21T19:06:41Z)
PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time. PARTIME starts processing each data sample at the time in which it becomes available from the stream. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z)
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines. We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z)
HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments [37.55572042288321]
Training process of neural networks (DNNs) generally handles large-scale input data with many sparse features. Paddle-HeterPS is composed of a distributed architecture and a Reinforcement Reinforcement (RL)-based scheduling method. We show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller)
arXiv Detail & Related papers (2021-11-20T17:09:15Z)
Scheduling Optimization Techniques for Neural Network Training [3.1617796705744547]
This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training. We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo backprop.
arXiv Detail & Related papers (2021-10-03T05:45:06Z)
Reservoir Stack Machines [77.12475691708838]
Memory-augmented neural networks equip a recurrent neural network with an explicit memory to support tasks that require information storage. We introduce the reservoir stack machine, a model which can provably recognize all deterministic context-free languages. Our results show that the reservoir stack machine achieves zero error, even on test sequences longer than the training data.
arXiv Detail & Related papers (2021-05-04T16:50:40Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
Importance of Data Loading Pipeline in Training Deep Neural Networks [2.127049691404299]
In large models, the time spent loading data takes a significant portion of model training time. We compare binary data format to accelerate data reading, and NVIDIA DALI to accelerate data augmentation. Our study shows improvement on the order of 20% to 40% if such dedicated tools are used.
arXiv Detail & Related papers (2020-04-21T14:19:48Z)
Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers [6.56704851092678]
We analyze distributed training performance under diverse cluster configurations using CM-DARE. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models.
arXiv Detail & Related papers (2020-04-07T01:49:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.