Deep learning: a statistical viewpoint
- URL: http://arxiv.org/abs/2103.09177v1
- Date: Tue, 16 Mar 2021 16:26:36 GMT
- Title: Deep learning: a statistical viewpoint
- Authors: Peter L. Bartlett and Andrea Montanari and Alexander Rakhlin
- Abstract summary: Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
- Score: 120.94133818355645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The remarkable practical success of deep learning has revealed some major
surprises from a theoretical perspective. In particular, simple gradient
methods easily find near-optimal solutions to non-convex optimization problems,
and despite giving a near-perfect fit to training data without any explicit
effort to control model complexity, these methods exhibit excellent predictive
accuracy. We conjecture that specific principles underlie these phenomena: that
overparametrization allows gradient methods to find interpolating solutions,
that these methods implicitly impose regularization, and that
overparametrization leads to benign overfitting. We survey recent theoretical
progress that provides examples illustrating these principles in simpler
settings. We first review classical uniform convergence results and why they
fall short of explaining aspects of the behavior of deep learning methods. We
give examples of implicit regularization in simple settings, where gradient
methods lead to minimal norm functions that perfectly fit the training data.
Then we review prediction methods that exhibit benign overfitting, focusing on
regression problems with quadratic loss. For these methods, we can decompose
the prediction rule into a simple component that is useful for prediction and a
spiky component that is useful for overfitting but, in a favorable setting,
does not harm prediction accuracy. We focus specifically on the linear regime
for neural networks, where the network can be approximated by a linear model.
In this regime, we demonstrate the success of gradient flow, and we consider
benign overfitting with two-layer networks, giving an exact asymptotic analysis
that precisely demonstrates the impact of overparametrization. We conclude by
highlighting the key challenges that arise in extending these insights to
realistic deep learning settings.
Related papers
- Embedding generalization within the learning dynamics: An approach based-on sample path large deviation theory [0.0]
We consider an empirical risk perturbation based learning problem that exploits methods from continuous-time perspective.
We provide an estimate in the small noise limit based on the Freidlin-Wentzell theory of large deviations.
We also present a computational algorithm that solves the corresponding variational problem leading to an optimal point estimates.
arXiv Detail & Related papers (2024-08-04T23:31:35Z) - A Rate-Distortion View of Uncertainty Quantification [36.85921945174863]
In supervised learning, understanding an input's proximity to the training data can help a model decide whether it has sufficient evidence for reaching a reliable prediction.
We introduce Distance Aware Bottleneck (DAB), a new method for enriching deep neural networks with this property.
arXiv Detail & Related papers (2024-06-16T01:33:22Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Scalable Bayesian Meta-Learning through Generalized Implicit Gradients [64.21628447579772]
Implicit Bayesian meta-learning (iBaML) method broadens the scope of learnable priors, but also quantifies the associated uncertainty.
Analytical error bounds are established to demonstrate the precision and efficiency of the generalized implicit gradient over the explicit one.
arXiv Detail & Related papers (2023-03-31T02:10:30Z) - Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient
for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research.
We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift.
Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z) - Calibrated and Sharp Uncertainties in Deep Learning via Simple Density
Estimation [7.184701179854522]
This paper argues for reasoning about uncertainty in terms these properties and proposes simple algorithms for enforcing them in deep learning.
Our methods focus on the strongest notion of calibration--distribution calibration--and enforce it by fitting a low-dimensional density or quantile function with a neural estimator.
Empirically, we find that our methods improve predictive uncertainties on several tasks with minimal computational and implementation overhead.
arXiv Detail & Related papers (2021-12-14T06:19:05Z) - From inexact optimization to learning via gradient concentration [22.152317081922437]
In this paper, we investigate the phenomenon in the context of linear models with smooth loss functions.
We propose a proof technique combining ideas from inexact optimization and probability theory, specifically gradient concentration.
arXiv Detail & Related papers (2021-06-09T21:23:29Z) - Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit
Bias towards Low Rank [1.9350867959464846]
In deep learning, gradientdescent tends to prefer solutions which generalize well.
In this paper we analyze the dynamics of gradient descent in the simplifiedsetting of linear networks and of an estimation problem.
arXiv Detail & Related papers (2020-11-27T15:08:34Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - Regularizing Meta-Learning via Gradient Dropout [102.29924160341572]
meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize.
We introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning.
arXiv Detail & Related papers (2020-04-13T10:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.