Implicit bias of SGD in $L_{2}$-regularized linear DNNs: One-way jumps
from high to low rank
- URL: http://arxiv.org/abs/2305.16038v2
- Date: Fri, 29 Sep 2023 13:18:59 GMT
- Title: Implicit bias of SGD in $L_{2}$-regularized linear DNNs: One-way jumps
from high to low rank
- Authors: Zihan Wang, Arthur Jacot
- Abstract summary: In tasks such as matrix completion, the goal is to converge to the local minimum with the smallest rank that still fits the training data.
We show that with SGD, there is always a probability to jump from a higher rank minimum to a lower rank one, but the probability of jumping back is zero.
- Score: 22.850359181621023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The $L_{2}$-regularized loss of Deep Linear Networks (DLNs) with more than
one hidden layers has multiple local minima, corresponding to matrices with
different ranks. In tasks such as matrix completion, the goal is to converge to
the local minimum with the smallest rank that still fits the training data.
While rank-underestimating minima can be avoided since they do not fit the
data, GD might get stuck at rank-overestimating minima. We show that with SGD,
there is always a probability to jump from a higher rank minimum to a lower
rank one, but the probability of jumping back is zero. More precisely, we
define a sequence of sets $B_{1}\subset B_{2}\subset\cdots\subset B_{R}$ so
that $B_{r}$ contains all minima of rank $r$ or less (and not more) that are
absorbing for small enough ridge parameters $\lambda$ and learning rates
$\eta$: SGD has prob. 0 of leaving $B_{r}$, and from any starting point there
is a non-zero prob. for SGD to go in $B_{r}$.
Related papers
- Low-rank Matrix Bandits with Heavy-tailed Rewards [55.03293214439741]
We study the problem of underlinelow-rank matrix bandit with underlineheavy-underlinetailed underlinerewards (LowHTR)
By utilizing the truncation on observed payoffs and the dynamic exploration, we propose a novel algorithm called LOTUS.
arXiv Detail & Related papers (2024-04-26T21:54:31Z) - Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization.
For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal.
All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Convergence of Gradient Descent with Small Initialization for
Unregularized Matrix Completion [21.846732043706318]
We show that the vanilla gradient descent provably converges to the ground truth $rmXstar$ without requiring any explicit regularization.
Surprisingly, neither the convergence rate nor the final accuracy depends on the over- parameterized search rank $r'$, and they are only governed by the true rank $r$.
arXiv Detail & Related papers (2024-02-09T19:39:23Z) - Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement
Learning with Latent Low-Rank Structure [9.759209713196718]
We consider a class of MDPs for which the associated optimal $Q*$ function is low rank, where the latent features are unknown.
We show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $tildeOleft((|S|+|A|)mathrmpoly(d,H)/epsilon2right)$ for a rank
arXiv Detail & Related papers (2022-06-07T20:39:51Z) - Sharp Global Guarantees for Nonconvex Low-Rank Matrix Recovery in the
Overparameterized Regime [16.422215672356167]
We prove that it is possible for non low-rank matrix recovery to contain no spurious local minima.
Without an explicit control over $rstarle r$, an RIP constant $delta1/2$ is both necessary and sufficient for the exact recovery of a rank-$r$ ground truth.
arXiv Detail & Related papers (2021-04-21T23:07:18Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z) - Average Case Column Subset Selection for Entrywise $\ell_1$-Norm Loss [76.02734481158458]
It is known that in the worst case, to obtain a good rank-$k$ approximation to a matrix, one needs an arbitrarily large $nOmega(1)$ number of columns.
We show that under certain minimal and realistic distributional settings, it is possible to obtain a $(k/epsilon)$-approximation with a nearly linear running time and poly$(k/epsilon)+O(klog n)$ columns.
This is the first algorithm of any kind for achieving a $(k/epsilon)$-approximation for entrywise
arXiv Detail & Related papers (2020-04-16T22:57:06Z) - Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning.
We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.