Solving Regularized Exp, Cosh and Sinh Regression Problems
- URL: http://arxiv.org/abs/2303.15725v2
- Date: Thu, 11 May 2023 00:22:34 GMT
- Title: Solving Regularized Exp, Cosh and Sinh Regression Problems
- Authors: Zhihang Li, Zhao Song, Tianyi Zhou
- Abstract summary: attention computation is a fundamental task for large language models such as Transformer, GPT-4 and ChatGPT.
The straightforward method is to use the naive Newton's method.
- Score: 40.47799094316649
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In modern machine learning, attention computation is a fundamental task for
training large language models such as Transformer, GPT-4 and ChatGPT. In this
work, we study exponential regression problem which is inspired by the
softmax/exp unit in the attention mechanism in large language models. The
standard exponential regression is non-convex. We study the regularization
version of exponential regression problem which is a convex problem. We use
approximate newton method to solve in input sparsity time.
Formally, in this problem, one is given matrix $A \in \mathbb{R}^{n \times
d}$, $b \in \mathbb{R}^n$, $w \in \mathbb{R}^n$ and any of functions $\exp,
\cosh$ and $\sinh$ denoted as $f$. The goal is to find the optimal $x$ that
minimize $ 0.5 \| f(Ax) - b \|_2^2 + 0.5 \| \mathrm{diag}(w) A x \|_2^2$. The
straightforward method is to use the naive Newton's method. Let
$\mathrm{nnz}(A)$ denote the number of non-zeros entries in matrix $A$. Let
$\omega$ denote the exponent of matrix multiplication. Currently, $\omega
\approx 2.373$. Let $\epsilon$ denote the accuracy error. In this paper, we
make use of the input sparsity and purpose an algorithm that use $\log ( \|x_0
- x^*\|_2 / \epsilon)$ iterations and $\widetilde{O}(\mathrm{nnz}(A) +
d^{\omega} )$ per iteration time to solve the problem.
Related papers
- Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Solving Attention Kernel Regression Problem via Pre-conditioner [9.131385887605935]
We design algorithms for two types of regression problems: $min_xin mathbbRd|(Atop A)jx-b|$ for any positive integer $j$.
The second proxy is applying exponential entrywise to the Gram matrix, denoted by $exp(AAtop)$ and solving the regression $min_xin mathbbRn|exp(AAtop)xb |$.
arXiv Detail & Related papers (2023-08-28T04:37:38Z) - Fast $(1+\varepsilon)$-Approximation Algorithms for Binary Matrix
Factorization [54.29685789885059]
We introduce efficient $(1+varepsilon)$-approximation algorithms for the binary matrix factorization (BMF) problem.
The goal is to approximate $mathbfA$ as a product of low-rank factors.
Our techniques generalize to other common variants of the BMF problem.
arXiv Detail & Related papers (2023-06-02T18:55:27Z) - Randomized and Deterministic Attention Sparsification Algorithms for
Over-parameterized Feature Dimension [18.57735939471469]
We consider the sparsification of the attention problem.
For any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence.
arXiv Detail & Related papers (2023-04-10T05:52:38Z) - An Over-parameterized Exponential Regression [18.57735939471469]
Recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions.
We define the neural function $F: mathbbRd times m times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd
arXiv Detail & Related papers (2023-03-29T07:29:07Z) - A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee [16.409210914237086]
Given a matrix $Ain mathbbRntimes d$ and a tensor $bin mathbbRn$, we consider the regression problem with $ell_infty$ guarantees.
We show that in order to obtain such $ell_infty$ guarantee for $ell$ regression, one has to use sketching matrices that are dense.
We also develop a novel analytical framework for $ell_infty$ guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property
arXiv Detail & Related papers (2023-02-01T05:22:40Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector.
We show that this dependence on $d$ is optimal, up to logarithmic factors.
We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.