Local Convergence of Approximate Newton Method for Two Layer Nonlinear
Regression
- URL: http://arxiv.org/abs/2311.15390v1
- Date: Sun, 26 Nov 2023 19:19:02 GMT
- Title: Local Convergence of Approximate Newton Method for Two Layer Nonlinear
Regression
- Authors: Zhihang Li, Zhao Song, Zifan Wang, Junze Yin
- Abstract summary: Two-layer regression problem has been well-studied in prior works.
First layer is activated by a ReLU unit, and the second layer is activated by a softmax unit.
We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions.
- Score: 21.849997443967705
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: There have been significant advancements made by large language models (LLMs)
in various aspects of our daily lives. LLMs serve as a transformative force in
natural language processing, finding applications in text generation,
translation, sentiment analysis, and question-answering. The accomplishments of
LLMs have led to a substantial increase in research efforts in this domain. One
specific two-layer regression problem has been well-studied in prior works,
where the first layer is activated by a ReLU unit, and the second layer is
activated by a softmax unit. While previous works provide a solid analysis of
building a two-layer regression, there is still a gap in the analysis of
constructing regression problems with more than two layers.
In this paper, we take a crucial step toward addressing this problem: we
provide an analysis of a two-layer regression problem. In contrast to previous
works, our first layer is activated by a softmax unit. This sets the stage for
future analyses of creating more activation functions based on the softmax
function. Rearranging the softmax function leads to significantly different
analyses. Our main results involve analyzing the convergence properties of an
approximate Newton method used to minimize the regularized training loss. We
prove that the loss function for the Hessian matrix is positive definite and
Lipschitz continuous under certain assumptions. This enables us to establish
local convergence guarantees for the proposed training algorithm. Specifically,
with an appropriate initialization and after $O(\log(1/\epsilon))$ iterations,
our algorithm can find an $\epsilon$-approximate minimizer of the training loss
with high probability. Each iteration requires approximately $O(\mathrm{nnz}(C)
+ d^\omega)$ time, where $d$ is the model size, $C$ is the input matrix, and
$\omega < 2.374$ is the matrix multiplication exponent.
Related papers
- How to Inverting the Leverage Score Distribution? [16.744561210470632]
Despite leverage scores being widely used as a tool, in this paper, we study a novel problem, namely the inverting leverage score.
We use iterative shrinking and the induction hypothesis to ensure global convergence rates for the Newton method.
This important study on inverting statistical leverage opens up numerous new applications in interpretation, data recovery, and security.
arXiv Detail & Related papers (2024-04-21T21:36:42Z) - A Unified Scheme of ResNet and Softmax [8.556540804058203]
We provide a theoretical analysis of the regression problem: $| langle exp(Ax) + A x, bf 1_n rangle-1 ( exp(Ax) + Ax )
This regression problem is a unified scheme that combines softmax regression and ResNet, which has never been done before.
arXiv Detail & Related papers (2023-09-23T21:41:01Z) - Convergence of Two-Layer Regression with Nonlinear Units [10.295897511849034]
We introduce a greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution.
We relax the Lipschitz condition and prove the convergence in the sense of loss value.
arXiv Detail & Related papers (2023-08-16T13:30:45Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Nonparametric regression with modified ReLU networks [77.34726150561087]
We consider regression estimation with modified ReLU neural networks in which network weight matrices are first modified by a function $alpha$ before being multiplied by input vectors.
arXiv Detail & Related papers (2022-07-17T21:46:06Z) - ReLU Regression with Massart Noise [52.10842036932169]
We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data.
We focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model.
We develop an efficient algorithm that achieves exact parameter recovery in this model.
arXiv Detail & Related papers (2021-09-10T02:13:22Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Learning to extrapolate using continued fractions: Predicting the
critical temperature of superconductor materials [5.905364646955811]
In the field of Artificial Intelligence (AI) and Machine Learning (ML), the approximation of unknown target functions $y=f(mathbfx)$ is a common objective.
We refer to $S$ as the training set and aim to identify a low-complexity mathematical model that can effectively approximate this target function for new instances $mathbfx$.
arXiv Detail & Related papers (2020-11-27T04:57:40Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Piecewise Linear Regression via a Difference of Convex Functions [50.89452535187813]
We present a new piecewise linear regression methodology that utilizes fitting a difference of convex functions (DC functions) to the data.
We empirically validate the method, showing it to be practically implementable, and to have comparable performance to existing regression/classification methods on real-world datasets.
arXiv Detail & Related papers (2020-07-05T18:58:47Z) - Statistical-Query Lower Bounds via Functional Gradients [19.5924910463796]
We show that any statistical-query algorithm with tolerance $n- (1/epsilon)b$ must use at least $2nc epsilon$ queries for some constant $b.
Our results rule out general (as opposed to correlational) SQ learning algorithms, which is unusual for real-valued learning problems.
arXiv Detail & Related papers (2020-06-29T05:15:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.