An Iterative Algorithm for Rescaled Hyperbolic Functions Regression
- URL: http://arxiv.org/abs/2305.00660v1
- Date: Mon, 1 May 2023 05:16:07 GMT
- Title: An Iterative Algorithm for Rescaled Hyperbolic Functions Regression
- Authors: Yeqi Gao, Zhao Song, Junze Yin
- Abstract summary: This paper studies the convergence of exponential regression and softmax regression.
We provide an input sparsity time algorithm for this problem.
Our algorithm framework is very general and can be applied to functions like $cosh()$ and $sinh()$ as well.
- Score: 15.090593955414137
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) have numerous real-life applications across
various domains, such as natural language translation, sentiment analysis,
language modeling, chatbots and conversational agents, creative writing, text
classification, summarization, and generation. LLMs have shown great promise in
improving the accuracy and efficiency of these tasks, and have the potential to
revolutionize the field of natural language processing (NLP) in the years to
come.
Exponential function based attention unit is a fundamental element in LLMs.
Several previous works have studied the convergence of exponential regression
and softmax regression.
The exponential regression [Li, Song, Zhou 2023] and softmax regression
[Deng, Li, Song 2023] can be formulated as follows. Given matrix $A \in
\mathbb{R}^{n \times d}$ and vector $b \in \mathbb{R}^n$, the goal of
exponential regression is to solve \begin{align*} \min_{x} \| \exp(Ax) - b \|_2
\end{align*} and the goal of softmax regression is to solve \begin{align*}
\min_{x} \| \langle \exp(Ax) , {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2 .
\end{align*}
In this work, we define a slightly different formulation than softmax
regression. \begin{align*} \min_{x \in \mathbb{R}^d } \| u(x) - \langle u(x) ,
{\bf 1}_n \rangle \cdot b \|_2 \end{align*} where $u(x) \in \{ \exp(Ax),
\cosh(Ax) , \sinh(Ax) \}$. We provide an input sparsity time algorithm for this
problem. Our algorithm framework is very general and can be applied to
functions like $\cosh()$ and $\sinh()$ as well. Our technique is also general
enough to be applied to in-context learning for rescaled softmax regression.
Related papers
- Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems.
Such problems are encountered in medicine, physics, and machine learning.
We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z) - A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - Local Convergence of Approximate Newton Method for Two Layer Nonlinear
Regression [21.849997443967705]
Two-layer regression problem has been well-studied in prior works.
First layer is activated by a ReLU unit, and the second layer is activated by a softmax unit.
We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions.
arXiv Detail & Related papers (2023-11-26T19:19:02Z) - A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts [28.13187489224953]
We propose a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions.
As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.
arXiv Detail & Related papers (2023-10-22T05:32:19Z) - The Closeness of In-Context Learning and Weight Shifting for Softmax
Regression [42.95984289657388]
We study the in-context learning based on a softmax regression formulation.
We show that when training self-attention-only Transformers for fundamental regression tasks, the models learned by gradient-descent and Transformers show great similarity.
arXiv Detail & Related papers (2023-04-26T04:33:41Z) - Adaptive LASSO estimation for functional hidden dynamic geostatistical
model [69.10717733870575]
We propose a novel model selection algorithm based on a penalized maximum likelihood estimator (PMLE) for functional hiddenstatistical models (f-HD)
The algorithm is based on iterative optimisation and uses an adaptive least absolute shrinkage and selector operator (GMSOLAS) penalty function, wherein the weights are obtained by the unpenalised f-HD maximum-likelihood estimators.
arXiv Detail & Related papers (2022-08-10T19:17:45Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Piecewise linear regression and classification [0.20305676256390928]
This paper proposes a method for solving multivariate regression and classification problems using piecewise linear predictors.
A Python implementation of the algorithm described in this paper is available at http://cse.lab.imtlucca.it/bemporad/parc.
arXiv Detail & Related papers (2021-03-10T17:07:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.