Accelerating AI Performance using Anderson Extrapolation on GPUs
- URL: http://arxiv.org/abs/2410.19460v1
- Date: Fri, 25 Oct 2024 10:45:17 GMT
- Title: Accelerating AI Performance using Anderson Extrapolation on GPUs
- Authors: Saleem Abdul Fattah Ahmed Al Dajani, David E. Keyes,
- Abstract summary: We present a novel approach for accelerating AI performance by leveraging Anderson extrapolation.
By identifying the crossover point where a mixing penalty is incurred, the method focuses on reducing iterations to convergence.
We demonstrate significant improvements in both training and inference, motivated by scalability and efficiency extensions to the realm of high-performance computing.
- Score: 2.114333871769023
- License:
- Abstract: We present a novel approach for accelerating AI performance by leveraging Anderson extrapolation, a vector-to-vector mapping technique based on a window of historical iterations. By identifying the crossover point where a mixing penalty is incurred, the method focuses on reducing iterations to convergence, with fewer more compute-intensive but generally cacheable iterations, balancing speed and memory usage with accuracy and algorithmic stability, respectively. We demonstrate significant improvements, in both training and inference, motivated by scalability and efficiency extensions to the realm of high-performance computing (HPC).
Related papers
- Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.
We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Ordering for Non-Replacement SGD [7.11967773739707]
We seek to find an ordering that can improve the convergence rates for the non-replacement form of the algorithm.
We develop optimal orderings for constant and decreasing step sizes for strongly convex and convex functions.
In addition, we are able to combine the ordering with mini-batch and further apply it to more complex neural networks.
arXiv Detail & Related papers (2023-06-28T00:46:58Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - Accelerating Real-Time Coupled Cluster Methods with Single-Precision
Arithmetic and Adaptive Numerical Integration [3.469636229370366]
We show that single-precision arithmetic reduces both the storage and multiplicative costs of the real-time simulation by approximately a factor of two.
Additional speedups of up to a factor of 14 in test simulations of water clusters are obtained via a straightforward-based implementation.
arXiv Detail & Related papers (2022-05-10T21:21:49Z) - Accelerated Componentwise Gradient Boosting using Efficient Data
Representation and Momentum-based Optimization [1.3159777131162964]
Componentwise boosting (CWB) builds on additive models as base learners to ensure interpretability.
One downside of CWB is its computational complexity in terms of memory and runtime.
We propose two techniques to overcome these issues without losing the properties of CWB.
arXiv Detail & Related papers (2021-10-07T14:49:52Z) - Nesterov Accelerated ADMM for Fast Diffeomorphic Image Registration [63.15453821022452]
Recent developments in approaches based on deep learning have achieved sub-second runtimes for DiffIR.
We propose a simple iterative scheme that functionally composes intermediate non-stationary velocity fields.
We then propose a convex optimisation model that uses a regularisation term of arbitrary order to impose smoothness on these velocity fields.
arXiv Detail & Related papers (2021-09-26T19:56:45Z) - Fast and Robust Iterative Closest Point [32.42799285301607]
Iterative Closest Point (ICP) is a fundamental technique for rigid registration between two point sets.
Recent work such as Sparse ICP achieves robustness via sparsity optimization at the cost of computational speed.
We show that the classical point-to-point ICP can be treated as a majorization-minimization (MM) algorithm, and propose an Anderson acceleration approach to speed up its convergence.
arXiv Detail & Related papers (2020-07-15T11:32:53Z) - Efficient Learning of Generative Models via Finite-Difference Score
Matching [111.55998083406134]
We present a generic strategy to efficiently approximate any-order directional derivative with finite difference.
Our approximation only involves function evaluations, which can be executed in parallel, and no gradient computations.
arXiv Detail & Related papers (2020-07-07T10:05:01Z) - Differentiable Adaptive Computation Time for Visual Reasoning [4.7518908453572]
This paper presents a novel attention-based algorithm for achieving adaptive computation called DACT.
In particular, we study its application to the widely known MAC architecture.
We show that by increasing the maximum number of steps used, we surpass the accuracy of even our best non-adaptive MAC in the CLEVR dataset.
arXiv Detail & Related papers (2020-04-27T13:20:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.