Improved Generalization Bound and Learning of Sparsity Patterns for
Data-Driven Low-Rank Approximation
- URL: http://arxiv.org/abs/2209.08281v1
- Date: Sat, 17 Sep 2022 08:26:27 GMT
- Title: Improved Generalization Bound and Learning of Sparsity Patterns for
Data-Driven Low-Rank Approximation
- Authors: Shinsaku Sakaue, Taihei Oki
- Abstract summary: We present a better $tildemathrmO(nsk)$ bound for rank-$k$ approximation.
We also prove that learning positions of non-zeros increases the fat shattering dimension only by $mathrmO(nslog n)$.
- Score: 15.191184049312467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning sketching matrices for fast and accurate low-rank approximation
(LRA) has gained increasing attention. Recently, Bartlett, Indyk, and Wagner
(COLT 2022) presented a generalization bound for the learning-based LRA.
Specifically, for rank-$k$ approximation using an $m \times n$ learned
sketching matrix with $s$ non-zeros in each column, they proved an
$\tilde{\mathrm{O}}(nsm)$ bound on the \emph{fat shattering dimension}
($\tilde{\mathrm{O}}$ hides logarithmic factors). We build on their work and
make two contributions.
1. We present a better $\tilde{\mathrm{O}}(nsk)$ bound ($k \le m$). En route
to obtaining the bound, we give a low-complexity \emph{Goldberg--Jerrum
algorithm} for computing pseudo-inverse matrices, which would be of independent
interest.
2. We alleviate an assumption of the previous study that the sparsity pattern
of sketching matrices is fixed. We prove that learning positions of non-zeros
increases the fat shattering dimension only by ${\mathrm{O}}(ns\log n)$. Also,
experiments confirm the practical benefit of learning sparsity patterns.
Related papers
- The Communication Complexity of Approximating Matrix Rank [50.6867896228563]
We show that this problem has randomized communication complexity $Omega(frac1kcdot n2log|mathbbF|)$.
As an application, we obtain an $Omega(frac1kcdot n2log|mathbbF|)$ space lower bound for any streaming algorithm with $k$ passes.
arXiv Detail & Related papers (2024-10-26T06:21:42Z) - Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Bottleneck Structure in Learned Features: Low-Dimension vs Regularity Tradeoff [12.351756386062291]
We formalize a balance between learning low-dimensional representations and minimizing complexity/irregularity in the feature maps.
For large depths, almost all hidden representations are approximately $R(0)(f)$-dimensional, and almost all weight matrices $W_ell$ have $R(0)(f)$ singular values close to 1.
Interestingly, the use of large learning rates is required to guarantee an order $O(L)$ NTK which in turns guarantees infinite depth convergence of the representations of almost all layers.
arXiv Detail & Related papers (2023-05-30T13:06:26Z) - Randomized and Deterministic Attention Sparsification Algorithms for
Over-parameterized Feature Dimension [18.57735939471469]
We consider the sparsification of the attention problem.
For any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence.
arXiv Detail & Related papers (2023-04-10T05:52:38Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Learning a Latent Simplex in Input-Sparsity Time [58.30321592603066]
We consider the problem of learning a latent $k$-vertex simplex $KsubsetmathbbRdtimes n$, given access to $AinmathbbRdtimes n$.
We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$.
arXiv Detail & Related papers (2021-05-17T16:40:48Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.