Related papers: Profiling and Improving the PyTorch Dataloader for high-latency Storage: A Technical Report

Profiling and Improving the PyTorch Dataloader for high-latency Storage: A Technical Report

URL: http://arxiv.org/abs/2211.04908v1
Date: Wed, 9 Nov 2022 14:16:30 GMT
Title: Profiling and Improving the PyTorch Dataloader for high-latency Storage: A Technical Report
Authors: Ivan Svogor, Christian Eichenberger, Markus Spanring, Moritz Neun, Michael Kopp
Abstract summary: This work focuses on the data loading pipeline in the PyTorch Framework. We show that for classification tasks that involve loading many files, like images, the training wall-time can be significantly improved. With our new, modified ConcurrentDataloader we can reach improvements in GPU utilization and significantly reduce batch loading time, up to 12X.
Score: 0.7349727826230862
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: A growing number of Machine Learning Frameworks recently made Deep Learning accessible to a wider audience of engineers, scientists, and practitioners, by allowing straightforward use of complex neural network architectures and algorithms. However, since deep learning is rapidly evolving, not only through theoretical advancements but also with respect to hardware and software engineering, ML frameworks often lose backward compatibility and introduce technical debt that can lead to bottlenecks and sub-optimal resource utilization. Moreover, the focus is in most cases not on deep learning engineering, but rather on new models and theoretical advancements. In this work, however, we focus on engineering, more specifically on the data loading pipeline in the PyTorch Framework. We designed a series of benchmarks that outline performance issues of certain steps in the data loading process. Our findings show that for classification tasks that involve loading many files, like images, the training wall-time can be significantly improved. With our new, modified ConcurrentDataloader we can reach improvements in GPU utilization and significantly reduce batch loading time, up to 12X. This allows for the use of the cloud-based, S3-like object storage for datasets, and have comparable training time as if datasets are stored on local drives.

Related papers

OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training [17.215899004049778]
We present OVERLORD, an industrial-grade distributed data loading architecture with three innovations. OVERLORD achieves: (1) 4.5x end-to-end training throughput improvement; (2) a minimum 3.6x reduction in CPU memory usage.
arXiv Detail & Related papers (2025-04-14T03:31:22Z)
Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs [8.00550423071637]
This dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs) It proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack.
arXiv Detail & Related papers (2024-12-06T03:20:03Z)
Bullion: A Column Store for Machine Learning [4.096087402737292]
This paper presents Bullion, a columnar storage system tailored for machine learning workloads. Bundy addresses the complexities of data compliance, optimize the encoding of long sequence sparse features, efficiently manages wide-table projections, introduces feature quantization in storage, and provides a comprehensive cascading encoding framework. Preliminary experimental results and theoretical analysis demonstrate Bullion's improved ability to deliver strong performance in the face of the unique demands of machine learning workloads.
arXiv Detail & Related papers (2024-04-13T05:01:54Z)
Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets. DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z)
Continual Learning with Transformers for Image Classification [12.028617058465333]
In computer vision, neural network models struggle to continually learn new concepts without forgetting what has been learnt in the past. We develop a solution called Adaptive Distillation of Adapters (ADA), which is developed to perform continual learning. We empirically demonstrate on different classification tasks that this method maintains a good predictive performance without retraining the model.
arXiv Detail & Related papers (2022-06-28T15:30:10Z)
Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision. We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z)
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines. We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training. We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark. In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z)
Continuum: Simple Management of Complex Continual Learning Scenarios [1.52292571922932]
Continual learning is a machine learning sub-field specialized in settings with non-iid data. Continual learning's challenge is to create algorithms able to learn an ever-growing amount of knowledge while dealing with data distribution drifts. Small errors in data loaders have a critical impact on algorithm results.
arXiv Detail & Related papers (2021-02-11T20:29:13Z)
Importance of Data Loading Pipeline in Training Deep Neural Networks [2.127049691404299]
In large models, the time spent loading data takes a significant portion of model training time. We compare binary data format to accelerate data reading, and NVIDIA DALI to accelerate data augmentation. Our study shows improvement on the order of 20% to 40% if such dedicated tools are used.
arXiv Detail & Related papers (2020-04-21T14:19:48Z)
How to 0wn NAS in Your Spare Time [11.997555708723523]
We design an algorithm that reconstructs the key components of a novel deep learning system by exploiting a small amount of information leakage from a cache side-channel attack. We demonstrate experimentally that we can reconstruct MalConv, a novel data pre-processing pipeline for malware detection, and ProxylessNAS CPU-NAS, a novel network architecture for ImageNet classification.
arXiv Detail & Related papers (2020-02-17T05:40:55Z)
Large-Scale Gradient-Free Deep Learning with Recursive Local Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources. Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize. We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.