Attention Scheme Inspired Softmax Regression
- URL: http://arxiv.org/abs/2304.10411v2
- Date: Wed, 26 Apr 2023 15:34:52 GMT
- Title: Attention Scheme Inspired Softmax Regression
- Authors: Yichuan Deng, Zhihang Li, Zhao Song
- Abstract summary: Large language models (LLMs) have made transformed changes for human society.
One of the key computation in LLMs is the softmax unit.
In this work, inspired the softmax unit, we define a softmax regression problem.
- Score: 20.825033982038455
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) have made transformed changes for human society.
One of the key computation in LLMs is the softmax unit. This operation is
important in LLMs because it allows the model to generate a distribution over
possible next words or phrases, given a sequence of input words. This
distribution is then used to select the most likely next word or phrase, based
on the probabilities assigned by the model. The softmax unit plays a crucial
role in training LLMs, as it allows the model to learn from the data by
adjusting the weights and biases of the neural network.
In the area of convex optimization such as using central path method to solve
linear programming. The softmax function has been used a crucial tool for
controlling the progress and stability of potential function [Cohen, Lee and
Song STOC 2019, Brand SODA 2020].
In this work, inspired the softmax unit, we define a softmax regression
problem. Formally speaking, given a matrix $A \in \mathbb{R}^{n \times d}$ and
a vector $b \in \mathbb{R}^n$, the goal is to use greedy type algorithm to
solve \begin{align*} \min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1}
\exp(Ax) - b \|_2^2. \end{align*} In certain sense, our provable convergence
result provides theoretical support for why we can use greedy algorithm to
train softmax function in practice.
Related papers
- MultiMax: Sparse and Multi-Modal Attention Learning [60.49318008131978]
SoftMax is a ubiquitous ingredient of modern machine learning algorithms.
We show that sparsity can be achieved by a family of SoftMax variants, but they often require an alternative loss function and do not preserve multi-modality.
We propose MultiMax, which adaptively modulates the output distribution according to input entry range.
arXiv Detail & Related papers (2024-06-03T10:51:43Z) - Binary Hypothesis Testing for Softmax Models and Leverage Score Models [8.06972158448711]
We consider the problem of binary hypothesis testing in the setting of softmax models.
We draw analogy between the softmax model and the leverage score model.
arXiv Detail & Related papers (2024-05-09T15:56:29Z) - Local Convergence of Approximate Newton Method for Two Layer Nonlinear
Regression [21.849997443967705]
Two-layer regression problem has been well-studied in prior works.
First layer is activated by a ReLU unit, and the second layer is activated by a softmax unit.
We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions.
arXiv Detail & Related papers (2023-11-26T19:19:02Z) - A Unified Scheme of ResNet and Softmax [8.556540804058203]
We provide a theoretical analysis of the regression problem: $| langle exp(Ax) + A x, bf 1_n rangle-1 ( exp(Ax) + Ax )
This regression problem is a unified scheme that combines softmax regression and ResNet, which has never been done before.
arXiv Detail & Related papers (2023-09-23T21:41:01Z) - Zero-th Order Algorithm for Softmax Attention Optimization [21.631643446337737]
We present a Zero-th Order algorithm specifically tailored for Softmax optimization.
We demonstrate the convergence of our algorithm, highlighting its effectiveness in efficiently computing gradients for large-scale language models.
arXiv Detail & Related papers (2023-07-17T09:43:50Z) - An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue
Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting.
Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function.
We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z) - An Iterative Algorithm for Rescaled Hyperbolic Functions Regression [15.090593955414137]
This paper studies the convergence of exponential regression and softmax regression.
We provide an input sparsity time algorithm for this problem.
Our algorithm framework is very general and can be applied to functions like $cosh()$ and $sinh()$ as well.
arXiv Detail & Related papers (2023-05-01T05:16:07Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.