An Iterative Algorithm for Rescaled Hyperbolic Functions Regression
- URL: http://arxiv.org/abs/2305.00660v1
- Date: Mon, 1 May 2023 05:16:07 GMT
- Title: An Iterative Algorithm for Rescaled Hyperbolic Functions Regression
- Authors: Yeqi Gao, Zhao Song, Junze Yin
- Abstract summary: This paper studies the convergence of exponential regression and softmax regression.
We provide an input sparsity time algorithm for this problem.
Our algorithm framework is very general and can be applied to functions like $cosh()$ and $sinh()$ as well.
- Score: 15.090593955414137
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) have numerous real-life applications across
various domains, such as natural language translation, sentiment analysis,
language modeling, chatbots and conversational agents, creative writing, text
classification, summarization, and generation. LLMs have shown great promise in
improving the accuracy and efficiency of these tasks, and have the potential to
revolutionize the field of natural language processing (NLP) in the years to
come.
Exponential function based attention unit is a fundamental element in LLMs.
Several previous works have studied the convergence of exponential regression
and softmax regression.
The exponential regression [Li, Song, Zhou 2023] and softmax regression
[Deng, Li, Song 2023] can be formulated as follows. Given matrix $A \in
\mathbb{R}^{n \times d}$ and vector $b \in \mathbb{R}^n$, the goal of
exponential regression is to solve \begin{align*} \min_{x} \| \exp(Ax) - b \|_2
\end{align*} and the goal of softmax regression is to solve \begin{align*}
\min_{x} \| \langle \exp(Ax) , {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2 .
\end{align*}
In this work, we define a slightly different formulation than softmax
regression. \begin{align*} \min_{x \in \mathbb{R}^d } \| u(x) - \langle u(x) ,
{\bf 1}_n \rangle \cdot b \|_2 \end{align*} where $u(x) \in \{ \exp(Ax),
\cosh(Ax) , \sinh(Ax) \}$. We provide an input sparsity time algorithm for this
problem. Our algorithm framework is very general and can be applied to
functions like $\cosh()$ and $\sinh()$ as well. Our technique is also general
enough to be applied to in-context learning for rescaled softmax regression.
Related papers
- How to Inverting the Leverage Score Distribution? [16.744561210470632]
Despite leverage scores being widely used as a tool, in this paper, we study a novel problem, namely the inverting leverage score.
We use iterative shrinking and the induction hypothesis to ensure global convergence rates for the Newton method.
This important study on inverting statistical leverage opens up numerous new applications in interpretation, data recovery, and security.
arXiv Detail & Related papers (2024-04-21T21:36:42Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Distribution-Independent Regression for Generalized Linear Models with
Oblivious Corruptions [49.69852011882769]
We show the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise.
We present an algorithm that tackles newthis problem in its most general distribution-independent setting.
This is the first newalgorithmic result for GLM regression newwith oblivious noise which can handle more than half the samples being arbitrarily corrupted.
arXiv Detail & Related papers (2023-09-20T21:41:59Z) - In-Context Learning for Attention Scheme: from Single Softmax Regression
to Multiple Softmax Regression via a Tensor Trick [15.090593955414137]
We consider the in-context learning under two formulation for attention related regression in this work.
Our regression problem shares similarities with previous studies on softmax-related regression.
arXiv Detail & Related papers (2023-07-05T16:41:01Z) - Attention Scheme Inspired Softmax Regression [20.825033982038455]
Large language models (LLMs) have made transformed changes for human society.
One of the key computation in LLMs is the softmax unit.
In this work, inspired the softmax unit, we define a softmax regression problem.
arXiv Detail & Related papers (2023-04-20T15:50:35Z) - An Over-parameterized Exponential Regression [18.57735939471469]
Recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions.
We define the neural function $F: mathbbRd times m times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd
arXiv Detail & Related papers (2023-03-29T07:29:07Z) - Solving Regularized Exp, Cosh and Sinh Regression Problems [40.47799094316649]
attention computation is a fundamental task for large language models such as Transformer, GPT-4 and ChatGPT.
The straightforward method is to use the naive Newton's method.
arXiv Detail & Related papers (2023-03-28T04:26:51Z) - Private Isotonic Regression [54.32252900997422]
We consider the problem of isotonic regression over a partially ordered set (poset) $mathcalX$ and for any Lipschitz loss function.
We obtain a pure-DP algorithm that has an expected excess empirical risk of roughly $mathrmwidth(mathcalX) cdot log|mathcalX| / n$, where $mathrmwidth(mathcalX)$ is the width of the poset.
We show that the bounds above are essentially the best that can be
arXiv Detail & Related papers (2022-10-27T05:08:07Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector.
We show that this dependence on $d$ is optimal, up to logarithmic factors.
We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z) - Optimal Mean Estimation without a Variance [103.26777953032537]
We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist.
We design an estimator which attains the smallest possible confidence interval as a function of $n,d,delta$.
arXiv Detail & Related papers (2020-11-24T22:39:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.