Analytic Study of Double Descent in Binary Classification: The Impact of
Loss
- URL: http://arxiv.org/abs/2001.11572v1
- Date: Thu, 30 Jan 2020 21:29:03 GMT
- Title: Analytic Study of Double Descent in Binary Classification: The Impact of
Loss
- Authors: Ganesh Kini and Christos Thrampoulidis
- Abstract summary: We show that the DD phenomenon persists, but we also identify several differences compared to logistic loss.
We further study the dependence of DD curves on the size of the training set.
- Score: 34.100845063076534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extensive empirical evidence reveals that, for a wide range of different
learning methods and datasets, the risk curve exhibits a double-descent (DD)
trend as a function of the model size. In a recent paper
[Zeyu,Kammoun,Thrampoulidis,2019] the authors studied binary linear
classification models and showed that the test error of gradient descent (GD)
with logistic loss undergoes a DD. In this paper, we complement these results
by extending them to GD with square loss. We show that the DD phenomenon
persists, but we also identify several differences compared to logistic loss.
This emphasizes that crucial features of DD curves (such as their transition
threshold and global minima) depend both on the training data and on the
learning algorithm. We further study the dependence of DD curves on the size of
the training set. Similar to our earlier work, our results are analytic: we
plot the DD curves by first deriving sharp asymptotics for the test error under
Gaussian features. Albeit simple, the models permit a principled study of DD
features, the outcomes of which theoretically corroborate related empirical
findings occurring in more complex learning tasks.
Related papers
- Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration [74.09687562334682]
We introduce a novel training data attribution method called Debias and Denoise Attribution (DDA)
Our method significantly outperforms existing approaches, achieving an averaged AUC of 91.64%.
DDA exhibits strong generality and scalability across various sources and different-scale models like LLaMA2, QWEN2, and Mistral.
arXiv Detail & Related papers (2024-10-02T07:14:26Z) - Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning [10.116674195405126]
We argue that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest.
Our formalization reveals novel applications of DD across different modeling environments.
We present numerical results for two case studies important in contemporary settings.
arXiv Detail & Related papers (2024-09-02T18:11:15Z) - Unveiling Multiple Descents in Unsupervised Autoencoders [13.180761892449736]
We show for the first time that double and triple descent can be observed with nonlinear unsupervised autoencoders.
Through extensive experiments on both synthetic and real datasets, we uncover model-wise, epoch-wise, and sample-wise double descent.
arXiv Detail & Related papers (2024-06-17T16:24:23Z) - PairCFR: Enhancing Model Training on Paired Counterfactually Augmented Data through Contrastive Learning [49.60634126342945]
Counterfactually Augmented Data (CAD) involves creating new data samples by applying minimal yet sufficient modifications to flip the label of existing data samples to other classes.
Recent research reveals that training with CAD may lead models to overly focus on modified features while ignoring other important contextual information.
We employ contrastive learning to promote global feature alignment in addition to learning counterfactual clues.
arXiv Detail & Related papers (2024-06-09T07:29:55Z) - Simple Ingredients for Offline Reinforcement Learning [86.1988266277766]
offline reinforcement learning algorithms have proven effective on datasets highly connected to the target downstream task.
We show that existing methods struggle with diverse data: their performance considerably deteriorates as data collected for related but different tasks is simply added to the offline buffer.
We show that scale, more than algorithmic considerations, is the key factor influencing performance.
arXiv Detail & Related papers (2024-03-19T18:57:53Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery [97.79015388276483]
A structural equation model (SEM) is an effective framework to reason over causal relationships represented via a directed acyclic graph (DAG)
Recent advances enabled effective maximum-likelihood point estimation of DAGs from observational data.
We propose BCD Nets, a variational framework for estimating a distribution over DAGs characterizing a linear-Gaussian SEM.
arXiv Detail & Related papers (2021-12-06T03:35:21Z) - Learning Curves for SGD on Structured Features [23.40229188549055]
We show that the geometry of the data in the induced feature space is crucial to accurately predict the test error throughout learning.
We show that modeling the geometry of the data in the induced feature space is indeed crucial to accurately predict the test error throughout learning.
arXiv Detail & Related papers (2021-06-04T20:48:20Z) - Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent.
We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z) - The Impact of the Mini-batch Size on the Variance of Gradients in
Stochastic Gradient Descent [28.148743710421932]
The mini-batch gradient descent (SGD) algorithm is widely used in training machine learning models.
We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks.
arXiv Detail & Related papers (2020-04-27T20:06:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.