Towards Learning High-Precision Least Squares Algorithms with Sequence Models
- URL: http://arxiv.org/abs/2503.12295v1
- Date: Sat, 15 Mar 2025 23:25:11 GMT
- Title: Towards Learning High-Precision Least Squares Algorithms with Sequence Models
- Authors: Jerry Liu, Jessica Grogan, Owen Dugan, Ashish Rao, Simran Arora, Atri Rudra, Christopher RĂ©,
- Abstract summary: We show that softmax Transformers struggle to perform high-precision multiplications.<n>We identify limitations present in existing architectures and training procedures.<n>For the first time, we demonstrate the ability to train to near machine precision.
- Score: 42.217390215093516
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to apply broadly across problem instances. We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000x lower MSE than standard Transformers trained end-to-end and they incur a 10,000x smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.
Related papers
- Learning Algorithm Hyperparameters for Fast Parametric Convex Optimization [2.0403774954994858]
We introduce a machine-learning framework to learn the hyper Parameters sequence of first-order methods.
We show how to learn the hyper Parameters for several popular algorithms.
We showcase the effectiveness of our method with many examples, including ones from control, signal processing, and machine learning.
arXiv Detail & Related papers (2024-11-24T04:58:36Z) - Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems.
Such problems are encountered in medicine, physics, and machine learning.
We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z) - Learning the Positions in CountSketch [49.57951567374372]
We consider sketching algorithms which first compress data by multiplication with a random sketch matrix, and then apply the sketch to quickly solve an optimization problem.
In this work, we propose the first learning-based algorithms that also optimize the locations of the non-zero entries.
arXiv Detail & Related papers (2023-06-11T07:28:35Z) - Fast Projected Newton-like Method for Precision Matrix Estimation under
Total Positivity [15.023842222803058]
Current algorithms are designed using the block coordinate descent method or the proximal point algorithm.
We propose a novel algorithm based on the two-metric projection method, incorporating a carefully designed search direction and variable partitioning scheme.
Experimental results on synthetic and real-world datasets demonstrate that our proposed algorithm provides a significant improvement in computational efficiency compared to the state-of-the-art methods.
arXiv Detail & Related papers (2021-12-03T14:39:10Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Effective Dimension Adaptive Sketching Methods for Faster Regularized
Least-Squares Optimization [56.05635751529922]
We propose a new randomized algorithm for solving L2-regularized least-squares problems based on sketching.
We consider two of the most popular random embeddings, namely, Gaussian embeddings and the Subsampled Randomized Hadamard Transform (SRHT)
arXiv Detail & Related papers (2020-06-10T15:00:09Z) - Linear time dynamic programming for the exact path of optimal models
selected from a finite set [0.38073142980732994]
We show that a dynamic programming algorithm can be used to compute the breakpoints in linear time.
Empirical results on changepoint detection problems demonstrate the improved accuracy and speed.
arXiv Detail & Related papers (2020-03-05T18:16:58Z) - Optimal Randomized First-Order Methods for Least-Squares Problems [56.05635751529922]
This class of algorithms encompasses several randomized methods among the fastest solvers for least-squares problems.
We focus on two classical embeddings, namely, Gaussian projections and subsampled Hadamard transforms.
Our resulting algorithm yields the best complexity known for solving least-squares problems with no condition number dependence.
arXiv Detail & Related papers (2020-02-21T17:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.