Solving Tensor Low Cycle Rank Approximation
- URL: http://arxiv.org/abs/2304.06594v1
- Date: Thu, 13 Apr 2023 15:00:50 GMT
- Title: Solving Tensor Low Cycle Rank Approximation
- Authors: Yichuan Deng, Yeqi Gao, Zhao Song
- Abstract summary: We formulate a particular tensor low rank approximation problem, we can call it tensor cycle rank.
For the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019].
In this paper, we generalize the previous rotation and sketch'' technique in page of [Song, Woodruff, Zhong SODA 2019] and show an input sparsity time algorithm for cycle rank.
- Score: 15.090593955414137
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models have become ubiquitous in modern life, finding
applications in various domains such as natural language processing, language
translation, and speech recognition. Recently, a breakthrough work [Zhao,
Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from
probabilistic context-free grammar (PCFG). One of the central computation task
for computing probability in PCFG is formulating a particular tensor low rank
approximation problem, we can call it tensor cycle rank. Given an $n \times n
\times n$ third order tensor $A$, we say that $A$ has cycle rank-$k$ if there
exists three $n \times k^2$ size matrices $U , V$, and $W$ such that for each
entry in each \begin{align*} A_{a,b,c} = \sum_{i=1}^k \sum_{j=1}^k \sum_{l=1}^k
U_{a,i+k(j-1)} \otimes V_{b, j + k(l-1)} \otimes W_{c, l + k(i-1) }
\end{align*} for all $a \in [n], b \in [n], c \in [n]$. For the tensor
classical rank, tucker rank and train rank, it has been well studied in [Song,
Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous
``rotation and sketch'' technique in page 186 of [Song, Woodruff, Zhong SODA
2019] and show an input sparsity time algorithm for cycle rank.
Related papers
- Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Sample-Efficient Linear Regression with Self-Selection Bias [7.605563562103568]
We consider the problem of linear regression with self-selection bias in the unknown-index setting.
We provide a novel and near optimally sample-efficient (in terms of $k$) algorithm to recover $mathbfw_1,ldots,mathbfw_kin.
Our algorithm succeeds under significantly relaxed noise assumptions, and therefore also succeeds in the related setting of max-linear regression.
arXiv Detail & Related papers (2024-02-22T02:20:24Z) - Structured Semidefinite Programming for Recovering Structured
Preconditioners [41.28701750733703]
We give an algorithm which, given positive definite $mathbfK in mathbbRd times d$ with $mathrmnnz(mathbfK)$ nonzero entries, computes an $epsilon$-optimal diagonal preconditioner in time.
We attain our results via new algorithms for a class of semidefinite programs we call matrix-dictionary approximation SDPs.
arXiv Detail & Related papers (2023-10-27T16:54:29Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Randomized and Deterministic Attention Sparsification Algorithms for
Over-parameterized Feature Dimension [18.57735939471469]
We consider the sparsification of the attention problem.
For any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence.
arXiv Detail & Related papers (2023-04-10T05:52:38Z) - An Over-parameterized Exponential Regression [18.57735939471469]
Recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions.
We define the neural function $F: mathbbRd times m times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd
arXiv Detail & Related papers (2023-03-29T07:29:07Z) - Average-Case Complexity of Tensor Decomposition for Low-Degree
Polynomials [93.59919600451487]
"Statistical-computational gaps" occur in many statistical inference tasks.
We consider a model for random order-3 decomposition where one component is slightly larger in norm than the rest.
We show that tensor entries can accurately estimate the largest component when $ll n3/2$ but fail to do so when $rgg n3/2$.
arXiv Detail & Related papers (2022-11-10T00:40:37Z) - Near-Linear Time and Fixed-Parameter Tractable Algorithms for Tensor
Decompositions [51.19236668224547]
We study low rank approximation of tensors, focusing on the tensor train and Tucker decompositions.
For tensor train decomposition, we give a bicriteria $(1 + eps)$-approximation algorithm with a small bicriteria rank and $O(q cdot nnz(A))$ running time.
In addition, we extend our algorithm to tensor networks with arbitrary graphs.
arXiv Detail & Related papers (2022-07-15T11:55:09Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Learning a Latent Simplex in Input-Sparsity Time [58.30321592603066]
We consider the problem of learning a latent $k$-vertex simplex $KsubsetmathbbRdtimes n$, given access to $AinmathbbRdtimes n$.
We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$.
arXiv Detail & Related papers (2021-05-17T16:40:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.