An Over-parameterized Exponential Regression
- URL: http://arxiv.org/abs/2303.16504v1
- Date: Wed, 29 Mar 2023 07:29:07 GMT
- Title: An Over-parameterized Exponential Regression
- Authors: Yeqi Gao, Sridhar Mahadevan, Zhao Song
- Abstract summary: Recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions.
We define the neural function $F: mathbbRd times m times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd
- Score: 18.57735939471469
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Over the past few years, there has been a significant amount of research
focused on studying the ReLU activation function, with the aim of achieving
neural network convergence through over-parametrization. However, recent
developments in the field of Large Language Models (LLMs) have sparked interest
in the use of exponential activation functions, specifically in the attention
mechanism.
Mathematically, we define the neural function $F: \mathbb{R}^{d \times m}
\times \mathbb{R}^d \rightarrow \mathbb{R}$ using an exponential activation
function. Given a set of data points with labels $\{(x_1, y_1), (x_2, y_2),
\dots, (x_n, y_n)\} \subset \mathbb{R}^d \times \mathbb{R}$ where $n$ denotes
the number of the data. Here $F(W(t),x)$ can be expressed as $F(W(t),x) :=
\sum_{r=1}^m a_r \exp(\langle w_r, x \rangle)$, where $m$ represents the number
of neurons, and $w_r(t)$ are weights at time $t$. It's standard in literature
that $a_r$ are the fixed weights and it's never changed during the training. We
initialize the weights $W(0) \in \mathbb{R}^{d \times m}$ with random Gaussian
distributions, such that $w_r(0) \sim \mathcal{N}(0, I_d)$ and initialize $a_r$
from random sign distribution for each $r \in [m]$.
Using the gradient descent algorithm, we can find a weight $W(T)$ such that
$\| F(W(T), X) - y \|_2 \leq \epsilon$ holds with probability $1-\delta$, where
$\epsilon \in (0,0.1)$ and $m = \Omega(n^{2+o(1)}\log(n/\delta))$. To optimize
the over-parameterization bound $m$, we employ several tight analysis
techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor,
Song and Woodruff ICML 2022].
Related papers
- Iterative thresholding for non-linear learning in the strong $\varepsilon$-contamination model [3.309767076331365]
We derive approximation bounds for learning single neuron models using thresholded descent.
We also study the linear regression problem, where $sigma(mathbfx) = mathbfx$.
arXiv Detail & Related papers (2024-09-05T16:59:56Z) - Sample-Efficient Linear Regression with Self-Selection Bias [7.605563562103568]
We consider the problem of linear regression with self-selection bias in the unknown-index setting.
We provide a novel and near optimally sample-efficient (in terms of $k$) algorithm to recover $mathbfw_1,ldots,mathbfw_kin.
Our algorithm succeeds under significantly relaxed noise assumptions, and therefore also succeeds in the related setting of max-linear regression.
arXiv Detail & Related papers (2024-02-22T02:20:24Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Optimal Estimator for Linear Regression with Shuffled Labels [17.99906229036223]
This paper considers the task of linear regression with shuffled labels.
$mathbf Y in mathbb Rntimes m, mathbf Pi in mathbb Rntimes p, mathbf B in mathbb Rptimes m$, and $mathbf Win mathbb Rntimes m$, respectively.
arXiv Detail & Related papers (2023-10-02T16:44:47Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Fast $(1+\varepsilon)$-Approximation Algorithms for Binary Matrix
Factorization [54.29685789885059]
We introduce efficient $(1+varepsilon)$-approximation algorithms for the binary matrix factorization (BMF) problem.
The goal is to approximate $mathbfA$ as a product of low-rank factors.
Our techniques generalize to other common variants of the BMF problem.
arXiv Detail & Related papers (2023-06-02T18:55:27Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Algorithms and Hardness for Linear Algebra on Geometric Graphs [14.822517769254352]
We show that the exponential dependence on the dimension dimension $d in the celebrated fast multipole method of Greengard and Rokhlin cannot be improved.
This is the first formal limitation proven about fast multipole methods.
arXiv Detail & Related papers (2020-11-04T18:35:02Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.