Wide flat minima and optimal generalization in classifying
high-dimensional Gaussian mixtures
- URL: http://arxiv.org/abs/2010.14761v2
- Date: Tue, 17 Nov 2020 16:06:55 GMT
- Title: Wide flat minima and optimal generalization in classifying
high-dimensional Gaussian mixtures
- Authors: Carlo Baldassi, Enrico M. Malatesta, Matteo Negri, Riccardo Zecchina
- Abstract summary: We show that there exist configurations that achieve the Bayes-optimal generalization error, even in the case of unbalanced clusters.
We also consider the algorithmically relevant case of targeting wide flat minima of the mean squared error loss.
- Score: 8.556763944288116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We analyze the connection between minimizers with good generalizing
properties and high local entropy regions of a threshold-linear classifier in
Gaussian mixtures with the mean squared error loss function. We show that there
exist configurations that achieve the Bayes-optimal generalization error, even
in the case of unbalanced clusters. We explore analytically the error-counting
loss landscape in the vicinity of a Bayes-optimal solution, and show that the
closer we get to such configurations, the higher the local entropy, implying
that the Bayes-optimal solution lays inside a wide flat region. We also
consider the algorithmically relevant case of targeting wide flat minima of the
(differentiable) mean squared error loss. Our analytical and numerical results
show not only that in the balanced case the dependence on the norm of the
weights is mild, but also, in the unbalanced case, that the performances can be
improved.
Related papers
- Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum [56.37522020675243]
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems.
We show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks.
arXiv Detail & Related papers (2024-10-22T10:19:27Z) - Learning minimal volume uncertainty ellipsoids [1.6795461001108096]
We consider the problem of learning uncertainty regions for parameter estimation problems.
Under the assumption of jointly Gaussian data, we prove that the optimal ellipsoid is centered around the conditional mean.
In more practical cases, we propose a differentiable optimization approach for approximately computing the optimal ellipsoids.
arXiv Detail & Related papers (2024-05-03T19:11:35Z) - Smoothing the Edges: Smooth Optimization for Sparse Regularization using Hadamard Overparametrization [10.009748368458409]
We present a framework for smooth optimization of explicitly regularized objectives for (structured) sparsity.
Our method enables fully differentiable approximation-free optimization and is thus compatible with the ubiquitous gradient descent paradigm in deep learning.
arXiv Detail & Related papers (2023-07-07T13:06:12Z) - Wasserstein Distributionally Robust Estimation in High Dimensions:
Performance Analysis and Optimal Hyperparameter Tuning [0.0]
We propose a Wasserstein distributionally robust estimation framework to estimate an unknown parameter from noisy linear measurements.
We focus on the task of analyzing the squared error performance of such estimators.
We show that the squared error can be recovered as the solution of a convex-concave optimization problem.
arXiv Detail & Related papers (2022-06-27T13:02:59Z) - Distributed Sketching for Randomized Optimization: Exact
Characterization, Concentration and Lower Bounds [54.51566432934556]
We consider distributed optimization methods for problems where forming the Hessian is computationally challenging.
We leverage randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems.
arXiv Detail & Related papers (2022-03-18T05:49:13Z) - Hyperspectral Image Denoising Using Non-convex Local Low-rank and Sparse
Separation with Spatial-Spectral Total Variation Regularization [49.55649406434796]
We propose a novel non particular approach to robust principal component analysis for HSI denoising.
We develop accurate approximations to both rank and sparse components.
Experiments on both simulated and real HSIs demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-01-08T11:48:46Z) - Minimax Supervised Clustering in the Anisotropic Gaussian Mixture Model:
A new take on Robust Interpolation [5.98367009147573]
We study the supervised clustering problem under the two-component anisotropic Gaussian mixture model.
We show that in the high-dimensional regime, the linear discriminant analysis (LDA) classifier turns out to be sub-optimal in the minimax sense.
arXiv Detail & Related papers (2021-11-13T05:19:37Z) - Optimal oracle inequalities for solving projected fixed-point equations [53.31620399640334]
We study methods that use a collection of random observations to compute approximate solutions by searching over a known low-dimensional subspace of the Hilbert space.
We show how our results precisely characterize the error of a class of temporal difference learning methods for the policy evaluation problem with linear function approximation.
arXiv Detail & Related papers (2020-12-09T20:19:32Z) - Entropic gradient descent algorithms and wide flat minima [6.485776570966397]
We show analytically that there exist Bayes optimal pointwise estimators which correspond to minimizers belonging to wide flat regions.
We extend the analysis to the deep learning scenario by extensive numerical validations.
An easy to compute flatness measure shows a clear correlation with test accuracy.
arXiv Detail & Related papers (2020-06-14T13:22:19Z) - Non-Convex SGD Learns Halfspaces with Adversarial Label Noise [50.659479930171585]
We show the solution to the problem of learning surrogate learning homogeneous halfspaces in the distribution-specific model.
In any convex distributions, we show that the misclassification error inherently leads to misclassification error of halfspace.
arXiv Detail & Related papers (2020-06-11T18:55:59Z) - Revisiting SGD with Increasingly Weighted Averaging: Optimization and
Generalization Perspectives [50.12802772165797]
The averaging technique combines all iterative solutions into a single solution.
Experiments have demonstrated trade-off and the effectiveness of averaging compared with other averaging schemes.
arXiv Detail & Related papers (2020-03-09T18:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.