Related papers: Stochastic Gradient Descent without Full Data Shuffle

Stochastic Gradient Descent without Full Data Shuffle

URL: http://arxiv.org/abs/2206.05830v1
Date: Sun, 12 Jun 2022 20:04:31 GMT
Title: Stochastic Gradient Descent without Full Data Shuffle
Authors: Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang
Abstract summary: CorgiPile is a hierarchical data shuffling strategy that avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. Our results show that CorgiPile can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models.
Score: 65.97105896033815
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement -- they all suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PyTorch by designing new parallel/distributed shuffle operators inside a new CorgiPileDataSet API. We also integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models. For deep learning models on ImageNet dataset, CorgiPile is 1.5X faster than PyTorch with full data shuffle. For in-DB ML with linear models, CorgiPile is 1.6X-12.8X faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.

Related papers

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization [2.2596489829928452]
This work generalizes work on 1D $s$-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) We implement all algorithms in C++ and MPI and evaluate their performance on a Cray EX supercomputing system.
arXiv Detail & Related papers (2025-01-13T17:56:39Z)
OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance [35.40320275366383]
Large-scale 3D parallel training on vision-language instruct-tuning models leads to an imbalanced computation load across different devices. We rebalanced the computational loads from data, model, and memory perspectives to address this issue. Our method's efficacy and generalizability were further demonstrated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z)
Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD [5.691144886263981]
We introduce a novel partial data shuffling strategy for Gradient Descent (SGD) It combines an offline iteration of the CorgiPile method with a subsequent online iteration. Our approach enjoys the best of both worlds: it performs similarly to SGD with random access (even for homogenous data) without compromising the data access efficiency of CorgiPile.
arXiv Detail & Related papers (2023-09-04T14:49:27Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Okapi: Generalising Better by Making Statistical Matches Match [7.392460712829188]
Okapi is a simple, efficient, and general method for robust semi-supervised learning based on online statistical matching. Our method uses a nearest-neighbours-based matching procedure to generate cross-domain views for a consistency loss. We show that it is in fact possible to leverage additional unlabelled data to improve upon empirical risk minimisation.
arXiv Detail & Related papers (2022-11-07T12:41:17Z)
Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV) NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones. We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z)
Injecting Domain Adaptation with Learning-to-hash for Effective and Efficient Zero-shot Dense Retrieval [49.98615945702959]
We evaluate LTH and vector compression techniques for improving the downstream zero-shot retrieval accuracy of the TAS-B dense retriever. Our results demonstrate that, unlike prior work, LTH strategies when applied naively can underperform the zero-shot TAS-B dense retriever on average by up to 14% nDCG@10.
arXiv Detail & Related papers (2022-05-23T17:53:44Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Superiority of Simplicity: A Lightweight Model for Network Device Workload Prediction [58.98112070128482]
We propose a lightweight solution for series prediction based on historic observations. It consists of a heterogeneous ensemble method composed of two models - a neural network and a mean predictor. It achieves an overall $R2$ score of 0.10 on the available FedCSIS 2020 challenge dataset.
arXiv Detail & Related papers (2020-07-07T15:44:16Z)
Sliced Iterative Normalizing Flows [7.6146285961466]
We develop an iterative (greedy) deep learning (DL) algorithm which is able to transform an arbitrary probability distribution function (PDF) into the target PDF. As special cases of this algorithm, we introduce two sliced iterative Normalizing Flow (SINF) models, which map from the data to the latent space (GIS) and vice versa.
arXiv Detail & Related papers (2020-07-01T18:00:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.