On Optimal Early Stopping: Over-informative versus Under-informative
Parametrization
- URL: http://arxiv.org/abs/2202.09885v2
- Date: Wed, 23 Feb 2022 22:54:02 GMT
- Title: On Optimal Early Stopping: Over-informative versus Under-informative
Parametrization
- Authors: Ruoqi Shen, Liyao Gao, Yi-An Ma
- Abstract summary: We develop theoretical results to reveal the relationship between the optimal early stopping time and model dimension.
We demonstrate experimentally that our theoretical results on optimal early stopping time corresponds to the training process of deep neural networks.
- Score: 13.159777131162961
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Early stopping is a simple and widely used method to prevent over-training
neural networks. We develop theoretical results to reveal the relationship
between the optimal early stopping time and model dimension as well as sample
size of the dataset for certain linear models. Our results demonstrate two very
different behaviors when the model dimension exceeds the number of features
versus the opposite scenario. While most previous works on linear models focus
on the latter setting, we observe that the dimension of the model often exceeds
the number of features arising from data in common deep learning tasks and
propose a model to study this setting. We demonstrate experimentally that our
theoretical results on optimal early stopping time corresponds to the training
process of deep neural networks.
Related papers
- Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation [53.27596811146316]
Diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts.
We present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples' loss gradient norms are highly dependent on timestep.
We introduce Diffusion-ReTrac as a re-normalized adaptation that enables the retrieval of training samples more targeted to the test sample of interest.
arXiv Detail & Related papers (2024-01-17T07:58:18Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - Uncertainty-Aware Time-to-Event Prediction using Deep Kernel Accelerated
Failure Time Models [11.171712535005357]
We propose Deep Kernel Accelerated Failure Time models for the time-to-event prediction task.
Our model shows better point estimate performance than recurrent neural network based baselines in experiments on two real-world datasets.
arXiv Detail & Related papers (2021-07-26T14:55:02Z) - Model-free prediction of emergence of extreme events in a parametrically
driven nonlinear dynamical system by Deep Learning [0.0]
We predict the emergence of extreme events in a parametrically driven nonlinear dynamical system.
We use three Deep Learning models, namely Multi-Layer Perceptron, Convolutional Neural Network and Long Short-Term Memory.
We find that the Long Short-Term Memory model can serve as the best model to forecast the chaotic time series.
arXiv Detail & Related papers (2021-07-14T14:48:57Z) - Provable Benefits of Overparameterization in Model Compression: From
Double Descent to Pruning Neural Networks [38.153825455980645]
Recent empirical evidence indicates that the practice of overization not only benefits training large models, but also assists - perhaps counterintuitively - building lightweight models.
This paper sheds light on these empirical findings by theoretically characterizing the high-dimensional toolsets of model pruning.
We analytically identify regimes in which, even if the location of the most informative features is known, we are better off fitting a large model and then pruning.
arXiv Detail & Related papers (2020-12-16T05:13:30Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - A Multi-Channel Neural Graphical Event Model with Negative Evidence [76.51278722190607]
Event datasets are sequences of events of various types occurring irregularly over the time-line.
We propose a non-parametric deep neural network approach in order to estimate the underlying intensity functions.
arXiv Detail & Related papers (2020-02-21T23:10:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.