An Iterative Algorithm for Rescaled Hyperbolic Functions Regression
- URL: http://arxiv.org/abs/2305.00660v2
- Date: Mon, 17 Feb 2025 05:12:27 GMT
- Title: An Iterative Algorithm for Rescaled Hyperbolic Functions Regression
- Authors: Yeqi Gao, Zhao Song, Junze Yin,
- Abstract summary: Large language models (LLMs) have numerous real-life applications across various domains.
LLMs have the potential to revolutionize the field of natural language processing (NLP)
- Score: 7.578147116161996
- License:
- Abstract: Large language models (LLMs) have numerous real-life applications across various domains, such as natural language translation, sentiment analysis, language modeling, chatbots and conversational agents, creative writing, text classification, summarization, and generation. LLMs have shown great promise in improving the accuracy and efficiency of these tasks, and have the potential to revolutionize the field of natural language processing (NLP) in the years to come. Exponential function based attention unit is a fundamental element in LLMs. Several previous works have studied the convergence of exponential regression and softmax regression. In this paper, we propose an iterative algorithm to solve a rescaled version of the slightly different formulation of the softmax regression problem that arises in attention mechanisms of large language models. Specifically, we consider minimizing the squared loss between a certain function, which can be either the exponential function, hyperbolic sine function, or hyperbolic cosine function, and its inner product with a target $n$-dimensional vector $b$, scaled by the normalization term. This ``rescaled softmax regression'' differs from classical softmax regression in the location of the normalization factor. The efficiency and generalizability of this framework to multiple hyperbolic functions make it relevant for optimizing attention mechanisms. The analysis also leads to a corollary bounding solution changes under small perturbations for in-context learning. Limitations and societal impact are discussed.
Related papers
- Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems.
Such problems are encountered in medicine, physics, and machine learning.
We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z) - A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - Local Convergence of Approximate Newton Method for Two Layer Nonlinear
Regression [21.849997443967705]
Two-layer regression problem has been well-studied in prior works.
First layer is activated by a ReLU unit, and the second layer is activated by a softmax unit.
We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions.
arXiv Detail & Related papers (2023-11-26T19:19:02Z) - A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts [28.13187489224953]
We propose a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions.
As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.
arXiv Detail & Related papers (2023-10-22T05:32:19Z) - The Closeness of In-Context Learning and Weight Shifting for Softmax
Regression [42.95984289657388]
We study the in-context learning based on a softmax regression formulation.
We show that when training self-attention-only Transformers for fundamental regression tasks, the models learned by gradient-descent and Transformers show great similarity.
arXiv Detail & Related papers (2023-04-26T04:33:41Z) - Adaptive LASSO estimation for functional hidden dynamic geostatistical
model [69.10717733870575]
We propose a novel model selection algorithm based on a penalized maximum likelihood estimator (PMLE) for functional hiddenstatistical models (f-HD)
The algorithm is based on iterative optimisation and uses an adaptive least absolute shrinkage and selector operator (GMSOLAS) penalty function, wherein the weights are obtained by the unpenalised f-HD maximum-likelihood estimators.
arXiv Detail & Related papers (2022-08-10T19:17:45Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Piecewise linear regression and classification [0.20305676256390928]
This paper proposes a method for solving multivariate regression and classification problems using piecewise linear predictors.
A Python implementation of the algorithm described in this paper is available at http://cse.lab.imtlucca.it/bemporad/parc.
arXiv Detail & Related papers (2021-03-10T17:07:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.