What killed the Convex Booster ?
- URL: http://arxiv.org/abs/2205.09628v1
- Date: Thu, 19 May 2022 15:42:20 GMT
- Title: What killed the Convex Booster ?
- Authors: Yishay Mansour and Richard Nock and Robert C. Williamson
- Abstract summary: A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio.
We argue that the source of the negative result lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML: textit parameterisation.
- Score: 70.04715330065275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A landmark negative result of Long and Servedio established a worst-case
spectacular failure of a supervised learning trio (loss, algorithm, model)
otherwise praised for its high precision machinery. Hundreds of papers followed
up on the two suspected culprits: the loss (for being convex) and/or the
algorithm (for fitting a classical boosting blueprint). Here, we call to the
half-century+ founding theory of losses for class probability estimation
(properness), an extension of Long and Servedio's results and a new general
boosting algorithm to demonstrate that the real culprit in their specific
context was in fact the (linear) model class. We advocate for a more general
stanpoint on the problem as we argue that the source of the negative result
lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML:
\textit{parameterisation}.
Related papers
- Sparse Double Descent: Where Network Pruning Aggravates Overfitting [8.425040193238777]
We report an unexpected sparse double descent phenomenon that, as we increase model sparsity via network pruning, test performance first gets worse.
We propose a novel learning distance interpretation that the curve of $ell_2$ learning distance of sparse models may correlate with the sparse double descent curve well.
arXiv Detail & Related papers (2022-06-17T11:02:15Z) - Lassoed Tree Boosting [53.56229983630983]
We prove that a gradient boosted tree algorithm with early stopping faster than $n-1/4$ L2 convergence in the large nonparametric space of cadlag functions of bounded sectional variation.
Our convergence proofs are based on a novel, general theorem on early stopping with empirical loss minimizers of nested Donsker classes.
arXiv Detail & Related papers (2022-05-22T00:34:41Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective.
We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities.
We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z) - Efficient First-Order Contextual Bandits: Prediction, Allocation, and
Triangular Discrimination [82.52105963476703]
A recurring theme in statistical learning, online learning, and beyond is that faster convergence rates are possible for problems with low noise.
First-order guarantees are relatively well understood in statistical and online learning.
We show that the logarithmic loss and an information-theoretic quantity called the triangular discrimination play a fundamental role in obtaining first-order guarantees.
arXiv Detail & Related papers (2021-07-05T19:20:34Z) - Being Properly Improper [36.52509571098292]
We analyse class probability-based losses when they are stripped off the mandatory properness.
We show that a natural extension of a half-century old loss introduced by S. Arimoto is twist proper.
We then turn to a theory that has provided some of the best off-the-shelf algorithms for proper losses, boosting.
arXiv Detail & Related papers (2021-06-18T05:00:15Z) - Provable tradeoffs in adversarially robust classification [96.48180210364893]
We develop and leverage new tools, including recent breakthroughs from probability theory on robust isoperimetry.
Our results reveal fundamental tradeoffs between standard and robust accuracy that grow when data is imbalanced.
arXiv Detail & Related papers (2020-06-09T09:58:19Z) - All your loss are belong to Bayes [28.393499629583786]
Loss functions are a cornerstone of machine learning and the starting point of most algorithms.
We introduce a trick on squared Gaussian Processes to obtain a random process whose paths are compliant source functions.
Experimental results demonstrate substantial improvements over the state of the art.
arXiv Detail & Related papers (2020-06-08T14:31:21Z) - Supervised Learning: No Loss No Cry [51.07683542418145]
Supervised learning requires the specification of a loss function to minimise.
This paper revisits the sc SLIsotron algorithm of Kakade et al. (2011) through a novel lens.
We show how it provides a principled procedure for learning the loss.
arXiv Detail & Related papers (2020-02-10T05:30:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.