When Does Preconditioning Help or Hurt Generalization?
- URL: http://arxiv.org/abs/2006.10732v4
- Date: Tue, 8 Dec 2020 19:12:44 GMT
- Title: When Does Preconditioning Help or Hurt Generalization?
- Authors: Shun-ichi Amari, Jimmy Ba, Roger Grosse, Xuechen Li, Atsushi Nitanda,
Taiji Suzuki, Denny Wu, Ji Xu
- Abstract summary: We show how the textitimplicit bias of first and second order methods affects the comparison of generalization properties.
We discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD.
- Score: 74.25170084614098
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While second order optimizers such as natural gradient descent (NGD) often
speed up optimization, their effect on generalization has been called into
question. This work presents a more nuanced view on how the \textit{implicit
bias} of first- and second-order methods affects the comparison of
generalization properties. We provide an exact asymptotic bias-variance
decomposition of the generalization error of overparameterized ridgeless
regression under a general class of preconditioner $\boldsymbol{P}$, and
consider the inverse population Fisher information matrix (used in NGD) as a
particular example. We determine the optimal $\boldsymbol{P}$ for both the bias
and variance, and find that the relative generalization performance of
different optimizers depends on the label noise and the "shape" of the signal
(true parameters): when the labels are noisy, the model is misspecified, or the
signal is misaligned with the features, NGD can achieve lower risk; conversely,
GD generalizes better than NGD under clean labels, a well-specified model, or
aligned signal. Based on this analysis, we discuss several approaches to manage
the bias-variance tradeoff, and the potential benefit of interpolating between
GD and NGD. We then extend our analysis to regression in the reproducing kernel
Hilbert space and demonstrate that preconditioned GD can decrease the
population risk faster than GD. Lastly, we empirically compare the
generalization error of first- and second-order optimizers in neural network
experiments, and observe robust trends matching our theoretical analysis.
Related papers
- Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression [4.150180443030652]
We introduce estimators that precisely track the generalization error of the iterates along the trajectory of the iterative algorithm.
The results are illustrated through several examples, including Huber regression, pseudo-Huber regression, and their penalized variants with non-smooth regularizer.
arXiv Detail & Related papers (2024-10-03T16:13:42Z) - Risk Bounds of Accelerated SGD for Overparameterized Linear Regression [75.27846230182885]
Accelerated gradient descent (ASGD) is a workhorse in deep learning.
Existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization.
arXiv Detail & Related papers (2023-11-23T23:02:10Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - CASTLE: Regularization via Auxiliary Causal Graph Discovery [89.74800176981842]
We introduce Causal Structure Learning (CASTLE) regularization and propose to regularize a neural network by jointly learning the causal relationships between variables.
CASTLE efficiently reconstructs only the features in the causal DAG that have a causal neighbor, whereas reconstruction-based regularizers suboptimally reconstruct all input features.
arXiv Detail & Related papers (2020-09-28T09:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.