Adaptive Sparse Softmax: An Effective and Efficient Softmax Variant
- URL: http://arxiv.org/abs/2508.03175v1
- Date: Tue, 05 Aug 2025 07:36:32 GMT
- Title: Adaptive Sparse Softmax: An Effective and Efficient Softmax Variant
- Authors: Qi Lv, Lei Geng, Ziqiang Cao, Min Cao, Sujian Li, Wenjie Li, Guohong Fu,
- Abstract summary: "Softmax" is the standard configuration for current neural classification models.<n>We propose the Adaptive Sparse softmax (AS-Softmax) which designs a reasonable and test-matching transformation on top of softmax.<n>We verify the proposed AS-Softmax on a variety of text multi-class, text multi-label, text token classification, image classification and audio classification tasks with class sizes ranging from 5 to 5000+.<n>The results show that AS-Softmax consistently outperforms softmax and its variants, and the loss of AS-Softmax is remarkably correlated with classification performance in validation.
- Score: 27.488444797784563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Softmax with the cross entropy loss is the standard configuration for current neural classification models. The gold score for a target class is supposed to be 1, but it is never reachable under the softmax schema. Such a problem makes the training process continue forever and leads to overfitting. Moreover, the "target-approach-1" training goal forces the model to continuously learn all samples, leading to a waste of time in handling some samples which have already been classified correctly with high confidence, while the test goal simply requires the target class of each sample to hold the maximum score. To solve the above weaknesses, we propose the Adaptive Sparse softmax (AS-Softmax) which designs a reasonable and test-matching transformation on top of softmax. For more purposeful learning, we discard the classes with far smaller scores compared with the actual class during training. Then the model could focus on learning to distinguish the target class from its strong opponents, which is also the great challenge in test. In addition, since the training losses of easy samples will gradually drop to 0 in AS-Softmax, we develop an adaptive gradient accumulation strategy based on the masked sample ratio to speed up training. We verify the proposed AS-Softmax on a variety of text multi-class, text multi-label, text token classification, image classification and audio classification tasks with class sizes ranging from 5 to 5000+. The results show that AS-Softmax consistently outperforms softmax and its variants, and the loss of AS-Softmax is remarkably correlated with classification performance in validation. Furthermore, adaptive gradient accumulation strategy can bring about 1.2x training speedup comparing with the standard softmax while maintaining classification effectiveness.
Related papers
- SeWA: Selective Weight Average via Probabilistic Masking [51.015724517293236]
We show that only a few points are needed to achieve better and faster convergence.<n>We transform the discrete selection problem into a continuous subset optimization framework.<n>We derive the SeWA's stability bounds, which are sharper than that under both convex image checkpoints.
arXiv Detail & Related papers (2025-02-14T12:35:21Z) - Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach.<n>Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z) - A two-head loss function for deep Average-K classification [8.189630642296416]
We propose a new loss function based on a multi-label classification in addition to the classical softmax.
We show that this approach allows the model to better capture ambiguities between classes and, as a result, to return more consistent sets of possible classes.
arXiv Detail & Related papers (2023-03-31T15:04:53Z) - Spectral Aware Softmax for Visible-Infrared Person Re-Identification [123.69049942659285]
Visible-infrared person re-identification (VI-ReID) aims to match specific pedestrian images from different modalities.
Existing methods still follow the softmax loss training paradigm, which is widely used in single-modality classification tasks.
We propose the spectral-aware softmax (SA-Softmax) loss, which can fully explore the embedding space with the modality information.
arXiv Detail & Related papers (2023-02-03T02:57:18Z) - NBC-Softmax : Darkweb Author fingerprinting and migration tracking [1.1470070927586016]
Metric learning aims to learn distances from the data, which enhances the performance of similarity-based algorithms.
We propose NBC-Softmax, a contrastive loss based clustering technique for softmax loss.
Our technique meets the criterion for larger number of samples, thus achieving block contrastiveness.
arXiv Detail & Related papers (2022-12-15T23:00:33Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Distinction Maximization Loss: Efficiently Improving Classification
Accuracy, Uncertainty Estimation, and Out-of-Distribution Detection Simply
Replacing the Loss and Calibrating [2.262407399039118]
We propose training deterministic deep neural networks using our DisMax loss.
DisMax usually outperforms all current approaches simultaneously in classification accuracy, uncertainty estimation, inference efficiency, and out-of-distribution detection.
arXiv Detail & Related papers (2022-05-12T04:37:35Z) - Real Additive Margin Softmax for Speaker Verification [14.226089039985151]
We show that AM-Softmax loss does not implement real max-margin training.
We present a Real AM-Softmax loss which involves a true margin function in the softmax training.
arXiv Detail & Related papers (2021-10-18T09:11:14Z) - Balanced Meta-Softmax for Long-Tailed Visual Recognition [46.215759445665434]
We show that the Softmax function, though used in most classification tasks, gives a biased gradient estimation under the long-tailed setup.
This paper presents Balanced Softmax, an elegant unbiased extension of Softmax, to accommodate the label distribution shift between training and testing.
In our experiments, we demonstrate that Balanced Meta-Softmax outperforms state-of-the-art long-tailed classification solutions on both visual recognition and instance segmentation tasks.
arXiv Detail & Related papers (2020-07-21T12:05:00Z) - Taming GANs with Lookahead-Minmax [63.90038365274479]
Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient.
Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels.
arXiv Detail & Related papers (2020-06-25T17:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.