In-Context Learning for Attention Scheme: from Single Softmax Regression
to Multiple Softmax Regression via a Tensor Trick
- URL: http://arxiv.org/abs/2307.02419v1
- Date: Wed, 5 Jul 2023 16:41:01 GMT
- Title: In-Context Learning for Attention Scheme: from Single Softmax Regression
to Multiple Softmax Regression via a Tensor Trick
- Authors: Yeqi Gao, Zhao Song, Shenghao Xie
- Abstract summary: We consider the in-context learning under two formulation for attention related regression in this work.
Our regression problem shares similarities with previous studies on softmax-related regression.
- Score: 15.090593955414137
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) have brought significant and transformative
changes in human society. These models have demonstrated remarkable
capabilities in natural language understanding and generation, leading to
various advancements and impacts across several domains.
We consider the in-context learning under two formulation for attention
related regression in this work. Given matrices $A_1 \in \mathbb{R}^{n \times
d}$, and $A_2 \in \mathbb{R}^{n \times d}$ and $B \in \mathbb{R}^{n \times n}$,
the purpose is to solve some certain optimization problems: Normalized version
$\min_{X} \| D(X)^{-1} \exp(A_1 X A_2^\top) - B \|_F^2$ and Rescaled version
$\| \exp(A_1 X A_2^\top) - D(X) \cdot B \|_F^2$. Here $D(X) := \mathrm{diag}(
\exp(A_1 X A_2^\top) {\bf 1}_n )$.
Our regression problem shares similarities with previous studies on
softmax-related regression. Prior research has extensively investigated
regression techniques related to softmax regression: Normalized version $\|
\langle \exp(Ax) , {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2^2$ and Resscaled
version $\| \exp(Ax) - \langle \exp(Ax), {\bf 1}_n \rangle b \|_2^2 $
In contrast to previous approaches, we adopt a vectorization technique to
address the regression problem in matrix formulation. This approach expands the
dimension from $d$ to $d^2$, resembling the formulation of the regression
problem mentioned earlier.
Upon completing the lipschitz analysis of our regression function, we have
derived our main result concerning in-context learning.
Related papers
- How to Inverting the Leverage Score Distribution? [16.744561210470632]
Despite leverage scores being widely used as a tool, in this paper, we study a novel problem, namely the inverting leverage score.
We use iterative shrinking and the induction hypothesis to ensure global convergence rates for the Newton method.
This important study on inverting statistical leverage opens up numerous new applications in interpretation, data recovery, and security.
arXiv Detail & Related papers (2024-04-21T21:36:42Z) - A Unified Scheme of ResNet and Softmax [8.556540804058203]
We provide a theoretical analysis of the regression problem: $| langle exp(Ax) + A x, bf 1_n rangle-1 ( exp(Ax) + Ax )
This regression problem is a unified scheme that combines softmax regression and ResNet, which has never been done before.
arXiv Detail & Related papers (2023-09-23T21:41:01Z) - Solving Attention Kernel Regression Problem via Pre-conditioner [9.131385887605935]
We design algorithms for two types of regression problems: $min_xin mathbbRd|(Atop A)jx-b|$ for any positive integer $j$.
The second proxy is applying exponential entrywise to the Gram matrix, denoted by $exp(AAtop)$ and solving the regression $min_xin mathbbRn|exp(AAtop)xb |$.
arXiv Detail & Related papers (2023-08-28T04:37:38Z) - An Iterative Algorithm for Rescaled Hyperbolic Functions Regression [15.090593955414137]
This paper studies the convergence of exponential regression and softmax regression.
We provide an input sparsity time algorithm for this problem.
Our algorithm framework is very general and can be applied to functions like $cosh()$ and $sinh()$ as well.
arXiv Detail & Related papers (2023-05-01T05:16:07Z) - Optimal Sketching Bounds for Sparse Linear Regression [116.30196615349226]
We study oblivious sketching for $k$-sparse linear regression under various loss functions such as an $ell_p$ norm, or from a broad class of hinge-like loss functions.
We show that for sparse $ell$vareps regression, there is an oblivious distribution over sketches with $Theta(klog(d/k)/varepsilon2)$ rows, which is tight up to a constant factor.
We also show that sketching $O(mu2 klog(mu n d/varepsilon)/var
arXiv Detail & Related papers (2023-04-05T07:24:19Z) - Almost Linear Constant-Factor Sketching for $\ell_1$ and Logistic
Regression [74.28017932704704]
We improve upon previous oblivious sketching and turnstile streaming results for $ell_1$ and logistic regression.
We also give a tradeoff that yields a $1+varepsilon$ approximation in input sparsity time.
Our sketch can be extended to approximate a regularized version of logistic regression where the data-dependent regularizer corresponds to the variance of the individual logistic losses.
arXiv Detail & Related papers (2023-03-31T18:12:33Z) - Solving Regularized Exp, Cosh and Sinh Regression Problems [40.47799094316649]
attention computation is a fundamental task for large language models such as Transformer, GPT-4 and ChatGPT.
The straightforward method is to use the naive Newton's method.
arXiv Detail & Related papers (2023-03-28T04:26:51Z) - Private Isotonic Regression [54.32252900997422]
We consider the problem of isotonic regression over a partially ordered set (poset) $mathcalX$ and for any Lipschitz loss function.
We obtain a pure-DP algorithm that has an expected excess empirical risk of roughly $mathrmwidth(mathcalX) cdot log|mathcalX| / n$, where $mathrmwidth(mathcalX)$ is the width of the poset.
We show that the bounds above are essentially the best that can be
arXiv Detail & Related papers (2022-10-27T05:08:07Z) - Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector.
We show that this dependence on $d$ is optimal, up to logarithmic factors.
We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z) - Statistical Query Lower Bounds for List-Decodable Linear Regression [55.06171096484622]
We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples.
Our main result is a Statistical Query (SQ) lower bound of $dmathrmpoly (1/alpha)$ for this problem.
arXiv Detail & Related papers (2021-06-17T17:45:21Z) - Online nonparametric regression with Sobolev kernels [99.12817345416846]
We derive the regret upper bounds on the classes of Sobolev spaces $W_pbeta(mathcalX)$, $pgeq 2, beta>fracdp$.
The upper bounds are supported by the minimax regret analysis, which reveals that in the cases $beta> fracd2$ or $p=infty$ these rates are (essentially) optimal.
arXiv Detail & Related papers (2021-02-06T15:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.