RINAS: Training with Dataset Shuffling Can Be General and Fast
- URL: http://arxiv.org/abs/2312.02368v1
- Date: Mon, 4 Dec 2023 21:50:08 GMT
- Title: RINAS: Training with Dataset Shuffling Can Be General and Fast
- Authors: Tianle Zhong, Jiechen Zhao, Xindi Guo, Qiang Su, Geoffrey Fox
- Abstract summary: RINAS is a data loading framework that addresses the performance bottleneck of loading global shuffled datasets.
We implement RINAS under the PyTorch framework for common dataset libraries HuggingFace and TorchVision.
Our experimental results show that RINAS improves the throughput of general language model training and vision model training by up to 59% and 89%, respectively.
- Score: 2.485503195398027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning datasets are expanding at an unprecedented pace, creating new
challenges for data processing in model training pipelines. A crucial aspect of
these pipelines is dataset shuffling, which significantly improves unbiased
learning and convergence accuracy by adhering to the principles of random
sampling. However, loading shuffled data for large datasets incurs significant
overhead in the deep learning pipeline and severely impacts the end-to-end
training throughput. To mitigate this, current deep learning systems often
resort to partial dataset shuffling, sacrificing global randomness to maintain
acceptable training throughput on large datasets, still leaving global
shuffling efficiency issues not fully explored.
In this work, we present RINAS, a data loading framework that systematically
addresses the performance bottleneck of loading global shuffled datasets. Our
key contribution is to offer an intra-batch unordered data fetching approach,
which unleashes unexplored parallelism of data loading. We implement RINAS
under the PyTorch framework for common dataset libraries HuggingFace and
TorchVision. Our experimental results show that RINAS improves the throughput
of general language model training and vision model training by up to 59% and
89%, respectively.
Related papers
- Long-Tailed Recognition on Binary Networks by Calibrating A Pre-trained Model [18.58663937035378]
We address the combined challenge of learning long-tailed distributions using highly resource-efficient binary neural networks as backbones.
We propose a calibrate-and-distill framework that uses off-the-shelf pretrained full-precision models trained on balanced datasets to use as teachers for distillation.
To better generalize to various datasets, we propose a novel adversarial balancing among the terms in the objective function and an efficient multiresolution learning scheme.
arXiv Detail & Related papers (2024-03-30T08:37:19Z) - Diffusion-based Neural Network Weights Generation [85.6725307453325]
We propose an efficient and adaptive transfer learning scheme through dataset-conditioned pretrained weights sampling.
Specifically, we use a latent diffusion model with a variational autoencoder that can reconstruct the neural network weights.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - Group Distributionally Robust Dataset Distillation with Risk
Minimization [18.07189444450016]
We introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD.
We demonstrate its effective generalization and robustness across subgroups through numerical experiments.
arXiv Detail & Related papers (2024-02-07T09:03:04Z) - Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data.
One key challenge in federated learning is to handle non-identically distributed data across the clients.
We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z) - Integrating Local Real Data with Global Gradient Prototypes for
Classifier Re-Balancing in Federated Long-Tailed Learning [60.41501515192088]
Federated Learning (FL) has become a popular distributed learning paradigm that involves multiple clients training a global model collaboratively.
The data samples usually follow a long-tailed distribution in the real world, and FL on the decentralized and long-tailed data yields a poorly-behaved global model.
In this work, we integrate the local real data with the global gradient prototypes to form the local balanced datasets.
arXiv Detail & Related papers (2023-01-25T03:18:10Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - Data Selection for Efficient Model Update in Federated Learning [0.07614628596146598]
We propose to reduce the amount of local data that is needed to train a global model.
We do this by splitting the model into a lower part for generic feature extraction and an upper part that is more sensitive to the characteristics of the local data.
Our experiments show that less than 1% of the local data can transfer the characteristics of the client data to the global model.
arXiv Detail & Related papers (2021-11-05T14:07:06Z) - A Data-Centric Approach for Training Deep Neural Networks with Less Data [1.9014535120129343]
This paper summarizes our winning submission to the "Data-Centric AI" competition.
We discuss some of the challenges that arise while training with a small dataset.
We propose a GAN-based solution for synthesizing new data points.
arXiv Detail & Related papers (2021-10-07T16:41:52Z) - One Backward from Ten Forward, Subsampling for Large-Scale Deep Learning [35.0157090322113]
Large-scale machine learning systems are often continuously trained with enormous data from production environments.
The sheer volume of streaming data poses a significant challenge to real-time training subsystems and ad-hoc sampling is the standard practice.
We propose to record a constant amount of information per instance from these forward passes. The extra information measurably improves the selection of which data instances should participate in forward and backward passes.
arXiv Detail & Related papers (2021-04-27T11:29:02Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.