Related papers: Cross-Entropy Loss Functions: Theoretical Analysis and Applications

Cross-Entropy Loss Functions: Theoretical Analysis and Applications

URL: http://arxiv.org/abs/2304.07288v2
Date: Tue, 20 Jun 2023 00:48:23 GMT
Title: Cross-Entropy Loss Functions: Theoretical Analysis and Applications
Authors: Anqi Mao, Mehryar Mohri, Yutao Zhong
Abstract summary: We present a theoretical analysis of a broad family of loss functions, that includes cross-entropy (or logistic loss), generalized cross-entropy, the mean absolute error and other cross-entropy-like loss functions. We show that these loss functions are beneficial in the adversarial setting by proving that they admit $H$-consistency bounds. This leads to new adversarial robustness algorithms that consist of minimizing a regularized smooth adversarial comp-sum loss.
Score: 27.3569897539488
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cross-entropy is a widely used loss function in applications. It coincides with the logistic loss applied to the outputs of a neural network, when the softmax is used. But, what guarantees can we rely on when using cross-entropy as a surrogate loss? We present a theoretical analysis of a broad family of loss functions, comp-sum losses, that includes cross-entropy (or logistic loss), generalized cross-entropy, the mean absolute error and other cross-entropy-like loss functions. We give the first $H$-consistency bounds for these loss functions. These are non-asymptotic guarantees that upper bound the zero-one loss estimation error in terms of the estimation error of a surrogate loss, for the specific hypothesis set $H$ used. We further show that our bounds are tight. These bounds depend on quantities called minimizability gaps. To make them more explicit, we give a specific analysis of these gaps for comp-sum losses. We also introduce a new family of loss functions, smooth adversarial comp-sum losses, that are derived from their comp-sum counterparts by adding in a related smooth term. We show that these loss functions are beneficial in the adversarial setting by proving that they admit $H$-consistency bounds. This leads to new adversarial robustness algorithms that consist of minimizing a regularized smooth adversarial comp-sum loss. While our main purpose is a theoretical analysis, we also present an extensive empirical analysis comparing comp-sum losses. We further report the results of a series of experiments demonstrating that our adversarial robustness algorithms outperform the current state-of-the-art, while also achieving a superior non-adversarial accuracy.

Related papers

Any-stepsize Gradient Descent for Separable Data under Fenchel--Young Losses [17.835960292396255]
We show arbitrary-stepsize gradient convergence for a general loss function based on the framework of emphFenchel--Young losses. We argue that these better rate is possible because of emphseparation margin of loss functions, instead of the self-bounding property.
arXiv Detail & Related papers (2025-02-07T12:52:12Z)
Loss Functions and Operators Generated by f-Divergences [21.58093510003414]
We propose to construct new convex loss functions based on $f$-divergences. By analogy with the logistic loss, the loss function generated by an $f$-divergence is associated with an operator, that we dub $f$-softargmax. One of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting.
arXiv Detail & Related papers (2025-01-30T18:06:18Z)
Of Dice and Games: A Theory of Generalized Boosting [61.752303337418475]
We extend the celebrated theory of boosting to incorporate both cost-sensitive and multi-objective losses. We develop a comprehensive theory of cost-sensitive and multi-objective boosting, providing a taxonomy of weak learning guarantees. Our characterization relies on a geometric interpretation of boosting, revealing a surprising equivalence between cost-sensitive and multi-objective losses.
arXiv Detail & Related papers (2024-12-11T01:38:32Z)
LEARN: An Invex Loss for Outlier Oblivious Robust Online Optimization [56.67706781191521]
An adversary can introduce outliers by corrupting loss functions in an arbitrary number of k, unknown to the learner. We present a robust online rounds optimization framework, where an adversary can introduce outliers by corrupting loss functions in an arbitrary number of k, unknown.
arXiv Detail & Related papers (2024-08-12T17:08:31Z)
Byzantine-resilient Federated Learning With Adaptivity to Data Heterogeneity [54.145730036889496]
This paper deals with Gradient learning (FL) in the presence of malicious attacks Byzantine data. A novel Average Algorithm (RAGA) is proposed, which leverages robustness aggregation and can select a dataset.
arXiv Detail & Related papers (2024-03-20T08:15:08Z)
Expressive Losses for Verified Robustness via Convex Combinations [67.54357965665676]
We study the relationship between the over-approximation coefficient and performance profiles across different expressive losses. We show that, while expressivity is essential, better approximations of the worst-case loss are not necessarily linked to superior robustness-accuracy trade-offs.
arXiv Detail & Related papers (2023-05-23T12:20:29Z)
An Analysis of Loss Functions for Binary Classification and Regression [0.0]
This paper explores connections between margin-based loss functions and consistency in binary classification and regression applications. A simple characterization for conformable (consistent) loss functions is given, which allows for straightforward comparison of different losses. A relation between the margin and standardized logistic regression residuals is derived, demonstrating that all margin-based loss can be viewed as loss functions of squared standardized logistic regression residuals.
arXiv Detail & Related papers (2023-01-18T16:26:57Z)
Loss Minimization through the Lens of Outcome Indistinguishability [11.709566373491619]
We present a new perspective on convex loss and the recent notion of Omniprediction. By design, Loss OI implies omniprediction in a direct and intuitive manner. We show that Loss OI for the important set of losses arising from Generalized Models, without requiring full multicalibration.
arXiv Detail & Related papers (2022-10-16T22:25:27Z)
$\mathscr{H}$-Consistency Estimation Error of Surrogate Loss Minimizers [38.56401704010528]
We present a detailed study of estimation errors in terms of surrogate loss estimation errors. We refer to such guarantees as $mathscrH$-consistency estimation error bounds.
arXiv Detail & Related papers (2022-05-16T23:13:36Z)
On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes. We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z)
Rethinking and Reweighting the Univariate Losses for Multi-Label Ranking: Consistency and Generalization [44.73295800450414]
(Partial) ranking loss is a commonly used evaluation measure for multi-label classification. There is a gap between existing theory and practice -- some pairwise losses can lead to promising performance but lack consistency.
arXiv Detail & Related papers (2021-05-10T09:23:27Z)
Calibration and Consistency of Adversarial Surrogate Losses [46.04004505351902]
Adrialversa robustness is an increasingly critical property of classifiers in applications. But which surrogate losses should be used and when do they benefit from theoretical guarantees? We present an extensive study of this question, including a detailed analysis of the H-calibration and H-consistency of adversarial surrogate losses.
arXiv Detail & Related papers (2021-04-19T21:58:52Z)
Approximation Schemes for ReLU Regression [80.33702497406632]
We consider the fundamental problem of ReLU regression. The goal is to output the best fitting ReLU with respect to square loss given to draws from some unknown distribution.
arXiv Detail & Related papers (2020-05-26T16:26:17Z)
Supervised Learning: No Loss No Cry [51.07683542418145]
Supervised learning requires the specification of a loss function to minimise. This paper revisits the sc SLIsotron algorithm of Kakade et al. (2011) through a novel lens. We show how it provides a principled procedure for learning the loss.
arXiv Detail & Related papers (2020-02-10T05:30:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.