Related papers: Stochastic Optimization with Laggard Data Pipelines

Stochastic Optimization with Laggard Data Pipelines

URL: http://arxiv.org/abs/2010.13639v1
Date: Mon, 26 Oct 2020 14:55:31 GMT
Title: Stochastic Optimization with Laggard Data Pipelines
Authors: Naman Agarwal, Rohan Anil, Tomer Koren, Kunal Talwar, Cyril Zhang
Abstract summary: We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
Score: 65.20044914532221
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State-of-the-art optimization is steadily shifting towards massively parallel pipelines with extremely large batch sizes. As a consequence, CPU-bound preprocessing and disk/memory/network operations have emerged as new performance bottlenecks, as opposed to hardware-accelerated gradient computations. In this regime, a recently proposed approach is data echoing (Choi et al., 2019), which takes repeated gradient steps on the same batch while waiting for fresh data to arrive from upstream. We provide the first convergence analyses of "data-echoed" extensions of common optimization methods, showing that they exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with stochastic minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.

Related papers

Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models [33.911521719528686]
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach is using Zeroth-Order (ZO) gradients, which estimates to replace First-Order (FO) gradients. We introduce a novel layer-wise sparse computation and memory efficient ZO, named LeZO.
arXiv Detail & Related papers (2024-10-13T12:47:37Z)
Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates [1.9241821314180372]
One major shortcoming of backpropagation is the interlocking between the forward and backward phases of the algorithm. We propose a method that parallelises SGD updates across the layers of a model by asynchronously updating them from multiple threads. We show that this approach yields close to state-of-theart results while running up to 2.97x faster than Hogwild! scaled on multiple devices.
arXiv Detail & Related papers (2024-10-08T12:32:36Z)
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings. We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z)
OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations [12.696136981847438]
We introduce first-order optimization expedited with approximately parallelized iterations (OptEx) OptEx is the first framework that enhances the efficiency of FOO by leveraging parallel computing to mitigate its iterative bottleneck. We provide theoretical guarantees for the reliability of our kernelized gradient estimation and the complexity of SGD-based OptEx.
arXiv Detail & Related papers (2024-02-18T02:19:02Z)
Ordering for Non-Replacement SGD [7.11967773739707]
We seek to find an ordering that can improve the convergence rates for the non-replacement form of the algorithm. We develop optimal orderings for constant and decreasing step sizes for strongly convex and convex functions. In addition, we are able to combine the ordering with mini-batch and further apply it to more complex neural networks.
arXiv Detail & Related papers (2023-06-28T00:46:58Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
Sampling without Replacement Leads to Faster Rates in Finite-Sum Minimax Optimization [12.794526174053134]
We analyze convergence rates of gradient algorithms for smooth finite-sum minimax optimization. We show that, for many such algorithms, sampling points without replacement leads to faster convergence compared to sampling with replacement.
arXiv Detail & Related papers (2022-06-07T00:37:37Z)
Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD) We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z)
On the Utility of Gradient Compression in Distributed Training Systems [9.017890174185872]
We evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD. Surprisingly, we observe that due to computation overheads introduced by gradient compression, the net speedup over vanilla data-parallel training is marginal, if not negative.
arXiv Detail & Related papers (2021-02-28T15:58:45Z)
Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics. We prove that our model-based procedure converges in noisy gradient setting. This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.