Importance of Data Loading Pipeline in Training Deep Neural Networks
- URL: http://arxiv.org/abs/2005.02130v1
- Date: Tue, 21 Apr 2020 14:19:48 GMT
- Title: Importance of Data Loading Pipeline in Training Deep Neural Networks
- Authors: Mahdi Zolnouri and Xinlin Li and Vahid Partovi Nia
- Abstract summary: In large models, the time spent loading data takes a significant portion of model training time.
We compare binary data format to accelerate data reading, and NVIDIA DALI to accelerate data augmentation.
Our study shows improvement on the order of 20% to 40% if such dedicated tools are used.
- Score: 2.127049691404299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training large-scale deep neural networks is a long, time-consuming
operation, often requiring many GPUs to accelerate. In large models, the time
spent loading data takes a significant portion of model training time. As GPU
servers are typically expensive, tricks that can save training time are
valuable.Slow training is observed especially on real-world applications where
exhaustive data augmentation operations are required. Data augmentation
techniques include: padding, rotation, adding noise, down sampling, up
sampling, etc. These additional operations increase the need to build an
efficient data loading pipeline, and to explore existing tools to speed up
training time. We focus on the comparison of two main tools designed for this
task, namely binary data format to accelerate data reading, and NVIDIA DALI to
accelerate data augmentation. Our study shows improvement on the order of 20%
to 40% if such dedicated tools are used.
Related papers
- DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort.
DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives.
For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z) - TensorSocket: Shared Data Loading for Deep Learning Training [0.0]
Deep learning training is a repetitive and resource-intensive process.
socket enables simultaneous training processes to share the same data loader.
Our evaluation shows thatsocket enables scenarios that are infeasible without data sharing, increases training throughput by up to $100%$.
arXiv Detail & Related papers (2024-09-27T13:39:47Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep
Recommendation Models [3.7414278978078204]
Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems.
The systems challenges faced in this setting are unique; while typical deep learning training jobs are dominated by model execution, the most important factor in DLRM training performance is often online data ingestion.
arXiv Detail & Related papers (2023-08-13T18:28:56Z) - CiT: Curation in Training for Effective Vision-Language Data [84.77867625605053]
This paper presents Curation in Training (CiT), a vision-text learning algorithm that couples a data objective into training.
CiT automatically yields quality data to speed-up contrastive image-text training.
We observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.
arXiv Detail & Related papers (2023-01-05T18:59:57Z) - Profiling and Improving the PyTorch Dataloader for high-latency Storage:
A Technical Report [0.7349727826230862]
This work focuses on the data loading pipeline in the PyTorch Framework.
We show that for classification tasks that involve loading many files, like images, the training wall-time can be significantly improved.
With our new, modified ConcurrentDataloader we can reach improvements in GPU utilization and significantly reduce batch loading time, up to 12X.
arXiv Detail & Related papers (2022-11-09T14:16:30Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - Dynamic Network-Assisted D2D-Aided Coded Distributed Learning [59.29409589861241]
We propose a novel device-to-device (D2D)-aided coded federated learning method (D2D-CFL) for load balancing across devices.
We derive an optimal compression rate for achieving minimum processing time and establish its connection with the convergence time.
Our proposed method is beneficial for real-time collaborative applications, where the users continuously generate training data.
arXiv Detail & Related papers (2021-11-26T18:44:59Z) - Multi-node Bert-pretraining: Cost-efficient Approach [6.5998084177955425]
Large scale Transformer-based language models have brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP) tasks.
With the advent of large-scale unsupervised datasets, training time is further extended due to the increased amount of data samples within a single training epoch.
We show that we are able to perform pre-training on BERT within a reasonable time budget (12 days) in an academic setting.
arXiv Detail & Related papers (2020-08-01T05:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.