Related papers: Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression

Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression

URL: http://arxiv.org/abs/2311.15390v1
Date: Sun, 26 Nov 2023 19:19:02 GMT
Title: Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression
Authors: Zhihang Li, Zhao Song, Zifan Wang, Junze Yin
Abstract summary: Two-layer regression problem has been well-studied in prior works. First layer is activated by a ReLU unit, and the second layer is activated by a softmax unit. We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions.
Score: 21.849997443967705
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: There have been significant advancements made by large language models (LLMs) in various aspects of our daily lives. LLMs serve as a transformative force in natural language processing, finding applications in text generation, translation, sentiment analysis, and question-answering. The accomplishments of LLMs have led to a substantial increase in research efforts in this domain. One specific two-layer regression problem has been well-studied in prior works, where the first layer is activated by a ReLU unit, and the second layer is activated by a softmax unit. While previous works provide a solid analysis of building a two-layer regression, there is still a gap in the analysis of constructing regression problems with more than two layers. In this paper, we take a crucial step toward addressing this problem: we provide an analysis of a two-layer regression problem. In contrast to previous works, our first layer is activated by a softmax unit. This sets the stage for future analyses of creating more activation functions based on the softmax function. Rearranging the softmax function leads to significantly different analyses. Our main results involve analyzing the convergence properties of an approximate Newton method used to minimize the regularized training loss. We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions. This enables us to establish local convergence guarantees for the proposed training algorithm. Specifically, with an appropriate initialization and after $O(\log(1/\epsilon))$ iterations, our algorithm can find an $\epsilon$-approximate minimizer of the training loss with high probability. Each iteration requires approximately $O(\mathrm{nnz}(C) + d^\omega)$ time, where $d$ is the model size, $C$ is the input matrix, and $\omega < 2.374$ is the matrix multiplication exponent.

Related papers

Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
How to Inverting the Leverage Score Distribution? [16.744561210470632]
Despite leverage scores being widely used as a tool, in this paper, we study a novel problem, namely the inverting leverage score. We use iterative shrinking and the induction hypothesis to ensure global convergence rates for the Newton method. This important study on inverting statistical leverage opens up numerous new applications in interpretation, data recovery, and security.
arXiv Detail & Related papers (2024-04-21T21:36:42Z)
A Unified Scheme of ResNet and Softmax [8.556540804058203]
We provide a theoretical analysis of the regression problem: $| langle exp(Ax) + A x, bf 1_n rangle-1 ( exp(Ax) + Ax ) This regression problem is a unified scheme that combines softmax regression and ResNet, which has never been done before.
arXiv Detail & Related papers (2023-09-23T21:41:01Z)
Convergence of Two-Layer Regression with Nonlinear Units [10.295897511849034]
We introduce a greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. We relax the Lipschitz condition and prove the convergence in the sense of loss value.
arXiv Detail & Related papers (2023-08-16T13:30:45Z)
An Iterative Algorithm for Rescaled Hyperbolic Functions Regression [7.578147116161996]
Large language models (LLMs) have numerous real-life applications across various domains. LLMs have the potential to revolutionize the field of natural language processing (NLP)
arXiv Detail & Related papers (2023-05-01T05:16:07Z)
Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes. This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z)
Nonparametric regression with modified ReLU networks [77.34726150561087]
We consider regression estimation with modified ReLU neural networks in which network weight matrices are first modified by a function $alpha$ before being multiplied by input vectors.
arXiv Detail & Related papers (2022-07-17T21:46:06Z)
ReLU Regression with Massart Noise [52.10842036932169]
We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data. We focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model. We develop an efficient algorithm that achieves exact parameter recovery in this model.
arXiv Detail & Related papers (2021-09-10T02:13:22Z)
Randomized Exploration for Reinforcement Learning with General Value Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm. Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z)
Learning to extrapolate using continued fractions: Predicting the critical temperature of superconductor materials [5.905364646955811]
In the field of Artificial Intelligence (AI) and Machine Learning (ML), the approximation of unknown target functions $y=f(mathbfx)$ is a common objective. We refer to $S$ as the training set and aim to identify a low-complexity mathematical model that can effectively approximate this target function for new instances $mathbfx$.
arXiv Detail & Related papers (2020-11-27T04:57:40Z)
Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$ We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)
Piecewise Linear Regression via a Difference of Convex Functions [50.89452535187813]
We present a new piecewise linear regression methodology that utilizes fitting a difference of convex functions (DC functions) to the data. We empirically validate the method, showing it to be practically implementable, and to have comparable performance to existing regression/classification methods on real-world datasets.
arXiv Detail & Related papers (2020-07-05T18:58:47Z)
Statistical-Query Lower Bounds via Functional Gradients [19.5924910463796]
We show that any statistical-query algorithm with tolerance $n- (1/epsilon)b$ must use at least $2nc epsilon$ queries for some constant $b. Our results rule out general (as opposed to correlational) SQ learning algorithms, which is unusual for real-valued learning problems.
arXiv Detail & Related papers (2020-06-29T05:15:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.