Depth Dependence of $\mu$P Learning Rates in ReLU MLPs
- URL: http://arxiv.org/abs/2305.07810v1
- Date: Sat, 13 May 2023 01:10:49 GMT
- Title: Depth Dependence of $\mu$P Learning Rates in ReLU MLPs
- Authors: Samy Jelassi, Boris Hanin, Ziwei Ji, Sashank J. Reddi, Srinadh
Bhojanapalli, Sanjiv Kumar
- Abstract summary: We study the dependence on $n$ and $L$ of the maximal update ($mu$P) learning rate.
We find that it has a non-trivial dependence of $L$, scaling like $L-3/2.$
- Score: 72.14317069090407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this short note we consider random fully connected ReLU networks of width
$n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose
is to study the dependence on $n$ and $L$ of the maximal update ($\mu$P)
learning rate, the largest learning rate for which the mean squared change in
pre-activations after one step of gradient descent remains uniformly bounded at
large $n,L$. As in prior work on $\mu$P of Yang et. al., we find that this
maximal update learning rate is independent of $n$ for all but the first and
last layer weights. However, we find that it has a non-trivial dependence of
$L$, scaling like $L^{-3/2}.$
Related papers
- Linear $Q$-Learning Does Not Diverge: Convergence Rates to a Bounded Set [34.129520133741124]
This paper establishes the first $L2$ convergence rate of linear $Q$-learning to a bounded set.
All we need is an $epsilon$-softmax behavior policy with an adaptive temperature.
arXiv Detail & Related papers (2025-01-31T16:10:50Z) - Learning Networks from Wide-Sense Stationary Stochastic Processes [7.59499154221528]
A key inference problem here is to learn edge connectivity from node outputs (potentials)
We use a Whittle's maximum likelihood estimator (MLE) to learn the support of $Last$ from temporally correlated samples.
We show that the MLE problem is strictly convex, admitting a unique solution.
arXiv Detail & Related papers (2024-12-04T23:14:00Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Maximal Initial Learning Rates in Deep ReLU Networks [32.157430904535126]
We introduce the maximal initial learning rate $etaast$.
We observe that in constant-width fully-connected ReLU networks, $etaast$ behaves differently from the maximum learning rate later in training.
arXiv Detail & Related papers (2022-12-14T15:58:37Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.