Stochastic Optimization with Laggard Data Pipelines
- URL: http://arxiv.org/abs/2010.13639v1
- Date: Mon, 26 Oct 2020 14:55:31 GMT
- Title: Stochastic Optimization with Laggard Data Pipelines
- Authors: Naman Agarwal, Rohan Anil, Tomer Koren, Kunal Talwar, Cyril Zhang
- Abstract summary: We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts.
Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
- Score: 65.20044914532221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art optimization is steadily shifting towards massively parallel
pipelines with extremely large batch sizes. As a consequence, CPU-bound
preprocessing and disk/memory/network operations have emerged as new
performance bottlenecks, as opposed to hardware-accelerated gradient
computations. In this regime, a recently proposed approach is data echoing
(Choi et al., 2019), which takes repeated gradient steps on the same batch
while waiting for fresh data to arrive from upstream. We provide the first
convergence analyses of "data-echoed" extensions of common optimization
methods, showing that they exhibit provable improvements over their synchronous
counterparts. Specifically, we show that in convex optimization with stochastic
minibatches, data echoing affords speedups on the curvature-dominated part of
the convergence rate, while maintaining the optimal statistical rate.
Related papers
- Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models [33.911521719528686]
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages.
A promising approach is using Zeroth-Order (ZO) gradients, which estimates to replace First-Order (FO) gradients.
We introduce a novel layer-wise sparse computation and memory efficient ZO, named LeZO.
arXiv Detail & Related papers (2024-10-13T12:47:37Z) - Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates [1.9241821314180372]
One major shortcoming of backpropagation is the interlocking between the forward and backward phases of the algorithm.
We propose a method that parallelises SGD updates across the layers of a model by asynchronously updating them from multiple threads.
We show that this approach yields close to state-of-theart results while running up to 2.97x faster than Hogwild! scaled on multiple devices.
arXiv Detail & Related papers (2024-10-08T12:32:36Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.
We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.
UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations [12.696136981847438]
We introduce first-order optimization expedited with approximately parallelized iterations (OptEx)
OptEx is the first framework that enhances the efficiency of FOO by leveraging parallel computing to mitigate its iterative bottleneck.
We provide theoretical guarantees for the reliability of our kernelized gradient estimation and the complexity of SGD-based OptEx.
arXiv Detail & Related papers (2024-02-18T02:19:02Z) - Ordering for Non-Replacement SGD [7.11967773739707]
We seek to find an ordering that can improve the convergence rates for the non-replacement form of the algorithm.
We develop optimal orderings for constant and decreasing step sizes for strongly convex and convex functions.
In addition, we are able to combine the ordering with mini-batch and further apply it to more complex neural networks.
arXiv Detail & Related papers (2023-06-28T00:46:58Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - Sampling without Replacement Leads to Faster Rates in Finite-Sum Minimax
Optimization [12.794526174053134]
We analyze convergence rates of gradient algorithms for smooth finite-sum minimax optimization.
We show that, for many such algorithms, sampling points without replacement leads to faster convergence compared to sampling with replacement.
arXiv Detail & Related papers (2022-06-07T00:37:37Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.