Stochastic Rounding Implicitly Regularizes Tall-and-Thin Matrices
- URL: http://arxiv.org/abs/2403.12278v3
- Date: Fri, 06 Dec 2024 19:10:07 GMT
- Title: Stochastic Rounding Implicitly Regularizes Tall-and-Thin Matrices
- Authors: Gregory Dexter, Christos Boutsikas, Linkai Ma, Ilse C. F. Ipsen, Petros Drineas,
- Abstract summary: We consider nearness rounding of real $mathbfA$ with many more rows than columns.<n>With high probability, the smallest singular value of a rounded matrix is well bounded away from zero.<n>We leverage powerful results in random matrix theory, and the idea that rounding errors do not concentrate in low-dimensional column spaces.
- Score: 5.324425600601921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the popularity of stochastic rounding in the context of machine learning and the training of large-scale deep neural network models, we consider stochastic nearness rounding of real matrices $\mathbf{A}$ with many more rows than columns. We provide novel theoretical evidence, supported by extensive experimental evaluation that, with high probability, the smallest singular value of a stochastically rounded matrix is well bounded away from zero -- regardless of how close $\mathbf{A}$ is to being rank deficient and even if $\mathbf{A}$ is rank-deficient. In other words, stochastic rounding \textit{implicitly regularizes} tall and skinny matrices $\mathbf{A}$ so that the rounded version has full column rank. Our proofs leverage powerful results in random matrix theory, and the idea that stochastic rounding errors do not concentrate in low-dimensional column spaces.
Related papers
- Asymptotic behavior of eigenvalues of large rank perturbations of large random matrices [0.0]
The paper is concerned with deformed Wigner random matrices.<n>The spectrum of such matrices plays a key role in rigorous underpinning of the novel pruning technique.<n>In this paper we develop analysis for the case of growing rank.
arXiv Detail & Related papers (2025-07-16T12:29:23Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Matrix Compression via Randomized Low Rank and Low Precision
Factorization [47.902465710511485]
Modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage.
We propose an algorithm that exploits this structure to obtain a low rank decomposition of any matrix $mathbfA$ as $mathbfLmathbfR$.
We empirically demonstrate the efficacy of our algorithm in image compression, nearest neighbor classification of image and text embeddings, and compressing the layers of LlaMa-$7$b.
arXiv Detail & Related papers (2023-10-17T06:56:57Z) - One-sided Matrix Completion from Two Observations Per Row [95.87811229292056]
We propose a natural algorithm that involves imputing the missing values of the matrix $XTX$.
We evaluate our algorithm on one-sided recovery of synthetic data and low-coverage genome sequencing.
arXiv Detail & Related papers (2023-06-06T22:35:16Z) - On the well-spread property and its relation to linear regression [4.619541348328937]
We show that consistent recovery of the parameter vector in a robust linear regression model is information-theoretically impossible.
We show that it is possible to efficiently certify whether a given $n$-by-$d$ matrix is well-spread if the number of observations is quadratic in the ambient dimension.
arXiv Detail & Related papers (2022-06-16T11:17:44Z) - Eigenvalue Distribution of Large Random Matrices Arising in Deep Neural
Networks: Orthogonal Case [1.6244541005112747]
The paper deals with the distribution of singular values of the input-output Jacobian of deep untrained neural networks in the limit of their infinite width.
It was claimed that in these cases the singular value distribution of the Jacobian in the limit of infinite width coincides with that of the analog of the Jacobian with special random but weight independent diagonal matrices.
arXiv Detail & Related papers (2022-01-12T16:33:47Z) - Test Set Sizing Via Random Matrix Theory [91.3755431537592]
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression.
It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise.
This paper is the first to solve for the training and test size for any model in a way that is truly optimal.
arXiv Detail & Related papers (2021-12-11T13:18:33Z) - Sparse Factorization of Large Square Matrices [10.94053598642913]
In this paper, we propose to approximate a large square matrix with a product of sparse full-rank matrices.
In the approximation, our method needs only $N(log N)2$ non-zero numbers for an $Ntimes N$ full matrix.
We show that our method gives a better approximation when the approximated matrix is sparse and high-rank.
arXiv Detail & Related papers (2021-09-16T18:42:21Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - Unique sparse decomposition of low rank matrices [17.037882881652617]
We find a unique decomposition of a low rank matrixYin mathbbRrtimes n$.
We prove that up to some $Yin mathRrtimes n$ is a sparse-wise decomposition of $Xin mathbbRrtimes n$.
arXiv Detail & Related papers (2021-06-14T20:05:59Z) - Algebraic and geometric structures inside the Birkhoff polytope [0.0]
Birkhoff polytope $mathcalB_d$ consists of all bistochastic matrices of order $d$.
We prove that $mathcalL_d$ and $mathcalF_d$ are star-shaped with respect to the flat matrix.
arXiv Detail & Related papers (2021-01-27T09:51:24Z) - Linear-Sample Learning of Low-Rank Distributions [56.59844655107251]
We show that learning $ktimes k$, rank-$r$, matrices to normalized $L_1$ distance requires $Omega(frackrepsilon2)$ samples.
We propose an algorithm that uses $cal O(frackrepsilon2log2fracepsilon)$ samples, a number linear in the high dimension, and nearly linear in the matrices, typically low, rank proofs.
arXiv Detail & Related papers (2020-09-30T19:10:32Z) - Optimal Exact Matrix Completion Under new Parametrization [0.0]
We study the problem of exact completion for $m times n$ sized matrix of rank $r$ with the adaptive sampling method.
We propose matrix completion algorithms that exactly recovers the target matrix.
arXiv Detail & Related papers (2020-02-06T18:31:47Z) - Optimal Iterative Sketching with the Subsampled Randomized Hadamard
Transform [64.90148466525754]
We study the performance of iterative sketching for least-squares problems.
We show that the convergence rate for Haar and randomized Hadamard matrices are identical, andally improve upon random projections.
These techniques may be applied to other algorithms that employ randomized dimension reduction.
arXiv Detail & Related papers (2020-02-03T16:17:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.