Related papers: An Iterative Algorithm for Rescaled Hyperbolic Functions Regression

An Iterative Algorithm for Rescaled Hyperbolic Functions Regression

URL: http://arxiv.org/abs/2305.00660v2
Date: Mon, 17 Feb 2025 05:12:27 GMT
Title: An Iterative Algorithm for Rescaled Hyperbolic Functions Regression
Authors: Yeqi Gao, Zhao Song, Junze Yin,
Abstract summary: Large language models (LLMs) have numerous real-life applications across various domains.<n>LLMs have the potential to revolutionize the field of natural language processing (NLP)
Score: 7.578147116161996
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) have numerous real-life applications across various domains, such as natural language translation, sentiment analysis, language modeling, chatbots and conversational agents, creative writing, text classification, summarization, and generation. LLMs have shown great promise in improving the accuracy and efficiency of these tasks, and have the potential to revolutionize the field of natural language processing (NLP) in the years to come. Exponential function based attention unit is a fundamental element in LLMs. Several previous works have studied the convergence of exponential regression and softmax regression. In this paper, we propose an iterative algorithm to solve a rescaled version of the slightly different formulation of the softmax regression problem that arises in attention mechanisms of large language models. Specifically, we consider minimizing the squared loss between a certain function, which can be either the exponential function, hyperbolic sine function, or hyperbolic cosine function, and its inner product with a target $n$-dimensional vector $b$, scaled by the normalization term. This ``rescaled softmax regression'' differs from classical softmax regression in the location of the normalization factor. The efficiency and generalizability of this framework to multiple hyperbolic functions make it relevant for optimizing attention mechanisms. The analysis also leads to a corollary bounding solution changes under small perturbations for in-context learning. Limitations and societal impact are discussed.

Related papers

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training [67.45211108321203]
We introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer.<n>We show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs.
arXiv Detail & Related papers (2025-06-05T16:50:23Z)
Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems. Such problems are encountered in medicine, physics, and machine learning. We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z)
A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning. These problems are often formalized as Bi-Level optimizations (BLO) We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
How to Inverting the Leverage Score Distribution? [16.744561210470632]
Despite leverage scores being widely used as a tool, in this paper, we study a novel problem, namely the inverting leverage score. We use iterative shrinking and the induction hypothesis to ensure global convergence rates for the Newton method. This important study on inverting statistical leverage opens up numerous new applications in interpretation, data recovery, and security.
arXiv Detail & Related papers (2024-04-21T21:36:42Z)
Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. In this work, we initiate the study of provably learning a multi-head attention layer from random examples. We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z)
Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression [21.849997443967705]
Two-layer regression problem has been well-studied in prior works. First layer is activated by a ReLU unit, and the second layer is activated by a softmax unit. We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions.
arXiv Detail & Related papers (2023-11-26T19:19:02Z)
A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts [28.13187489224953]
We propose a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions. As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.
arXiv Detail & Related papers (2023-10-22T05:32:19Z)
Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions [49.69852011882769]
We show the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We present an algorithm that tackles newthis problem in its most general distribution-independent setting. This is the first newalgorithmic result for GLM regression newwith oblivious noise which can handle more than half the samples being arbitrarily corrupted.
arXiv Detail & Related papers (2023-09-20T21:41:59Z)
In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick [15.090593955414137]
We consider the in-context learning under two formulation for attention related regression in this work. Our regression problem shares similarities with previous studies on softmax-related regression.
arXiv Detail & Related papers (2023-07-05T16:41:01Z)
The Closeness of In-Context Learning and Weight Shifting for Softmax Regression [42.95984289657388]
We study the in-context learning based on a softmax regression formulation. We show that when training self-attention-only Transformers for fundamental regression tasks, the models learned by gradient-descent and Transformers show great similarity.
arXiv Detail & Related papers (2023-04-26T04:33:41Z)
Attention Scheme Inspired Softmax Regression [20.825033982038455]
Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. In this work, inspired the softmax unit, we define a softmax regression problem.
arXiv Detail & Related papers (2023-04-20T15:50:35Z)
An Over-parameterized Exponential Regression [18.57735939471469]
Recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions. We define the neural function $F: mathbbRd times m times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd
arXiv Detail & Related papers (2023-03-29T07:29:07Z)
Solving Regularized Exp, Cosh and Sinh Regression Problems [40.47799094316649]
attention computation is a fundamental task for large language models such as Transformer, GPT-4 and ChatGPT. The straightforward method is to use the naive Newton's method.
arXiv Detail & Related papers (2023-03-28T04:26:51Z)
Private Isotonic Regression [54.32252900997422]
We consider the problem of isotonic regression over a partially ordered set (poset) $mathcalX$ and for any Lipschitz loss function. We obtain a pure-DP algorithm that has an expected excess empirical risk of roughly $mathrmwidth(mathcalX) cdot log|mathcalX| / n$, where $mathrmwidth(mathcalX)$ is the width of the poset. We show that the bounds above are essentially the best that can be
arXiv Detail & Related papers (2022-10-27T05:08:07Z)
Adaptive LASSO estimation for functional hidden dynamic geostatistical model [69.10717733870575]
We propose a novel model selection algorithm based on a penalized maximum likelihood estimator (PMLE) for functional hiddenstatistical models (f-HD) The algorithm is based on iterative optimisation and uses an adaptive least absolute shrinkage and selector operator (GMSOLAS) penalty function, wherein the weights are obtained by the unpenalised f-HD maximum-likelihood estimators.
arXiv Detail & Related papers (2022-08-10T19:17:45Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
Learning a Single Neuron with Adversarial Label Noise via Gradient Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations. The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z)
Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector. We show that this dependence on $d$ is optimal, up to logarithmic factors. We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z)
Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)
Piecewise linear regression and classification [0.20305676256390928]
This paper proposes a method for solving multivariate regression and classification problems using piecewise linear predictors. A Python implementation of the algorithm described in this paper is available at http://cse.lab.imtlucca.it/bemporad/parc.
arXiv Detail & Related papers (2021-03-10T17:07:57Z)
Optimal Mean Estimation without a Variance [103.26777953032537]
We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist. We design an estimator which attains the smallest possible confidence interval as a function of $n,d,delta$.
arXiv Detail & Related papers (2020-11-24T22:39:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.