Related papers: Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

URL: http://arxiv.org/abs/2006.07322v5
Date: Sat, 23 Oct 2021 00:36:12 GMT
Title: Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks
Authors: Like Hui and Mikhail Belkin
Abstract summary: Cross-entropy loss is widely believed to be empirically superior to the square loss for classification tasks. We show that these neural architectures perform comparably or better when trained with the square loss. Cross-entropy appears to have a slight edge on computer vision tasks.
Score: 23.538629997497747
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is widely believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. Furthermore, training with square loss appears to be less sensitive to the randomness in initialization. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy.

Related papers

Deep Loss Convexification for Learning Iterative Models [11.36644967267829]
Iterative methods such as iterative closest point (ICP) for point cloud registration often suffer from bad local optimality. We propose learning to form a convex landscape around each ground truth.
arXiv Detail & Related papers (2024-11-16T01:13:04Z)
Towards Robust Out-of-Distribution Generalization: Data Augmentation and Neural Architecture Search Approaches [4.577842191730992]
We study ways toward robust OoD generalization for deep learning. We first propose a novel and effective approach to disentangle the spurious correlation between features that are not essential for recognition. We then study the problem of strengthening neural architecture search in OoD scenarios.
arXiv Detail & Related papers (2024-10-25T20:50:32Z)
Cut your Losses with Squentropy [19.924900110707284]
We propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes. We show that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy.
arXiv Detail & Related papers (2023-02-08T09:21:13Z)
CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z)
Do We Really Need a Learnable Classifier at the End of Deep Neural Network? [118.18554882199676]
We study the potential of learning a neural network for classification with the classifier randomly as an ETF and fixed during training. Our experimental results show that our method is able to achieve similar performances on image classification for balanced datasets.
arXiv Detail & Related papers (2022-03-17T04:34:28Z)
Understanding Square Loss in Training Overparametrized Neural Network Classifiers [31.319145959402462]
We contribute to the theoretical understanding of square loss in classification by systematically investigating how it performs for overparametrized neural networks. We consider two cases, according to whether classes are separable or not. In the general non-separable case, fast convergence rate is established for both misclassification rate and calibration error. The resulting margin is proven to be lower bounded away from zero, providing theoretical guarantees for robustness.
arXiv Detail & Related papers (2021-12-07T12:12:30Z)
Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function. We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z)
Mixing between the Cross Entropy and the Expectation Loss Terms [89.30385901335323]
Cross entropy loss tends to focus on hard to classify samples during training. We show that adding to the optimization goal the expectation loss helps the network to achieve better accuracy. Our experiments show that the new training protocol improves performance across a diverse set of classification domains.
arXiv Detail & Related papers (2021-09-12T23:14:06Z)
Improving Music Performance Assessment with Contrastive Learning [78.8942067357231]
This study investigates contrastive learning as a potential method to improve existing MPA systems. We introduce a weighted contrastive loss suitable for regression tasks applied to a convolutional neural network. Our results show that contrastive-based methods are able to match and exceed SoTA performance for MPA regression tasks.
arXiv Detail & Related papers (2021-08-03T19:24:25Z)
Disp R-CNN: Stereo 3D Object Detection via Shape Prior Guided Instance Disparity Estimation [51.17232267143098]
We propose a novel system named Disp R-CNN for 3D object detection from stereo images. We use a statistical shape model to generate dense disparity pseudo-ground-truth without the need of LiDAR point clouds. Experiments on the KITTI dataset show that, even when LiDAR ground-truth is not available at training time, Disp R-CNN achieves competitive performance and outperforms previous state-of-the-art methods by 20% in terms of average precision.
arXiv Detail & Related papers (2020-04-07T17:48:45Z)
Avoiding Spurious Local Minima in Deep Quadratic Networks [0.0]
We characterize the landscape of the mean squared nonlinear error for networks with neural activation functions. We prove that deepized neural networks with quadratic activations benefit from similar landscape properties.
arXiv Detail & Related papers (2019-12-31T22:31:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.