Extreme Classification via Adversarial Softmax Approximation
- URL: http://arxiv.org/abs/2002.06298v1
- Date: Sat, 15 Feb 2020 01:42:52 GMT
- Title: Extreme Classification via Adversarial Softmax Approximation
- Authors: Robert Bamler and Stephan Mandt
- Abstract summary: We propose a simple training method for drastically enhancing the gradient signal by drawing negative samples from an adversarial model.
Our contributions are three-fold: (i) an adversarial sampling mechanism that produces negative samples at a cost only logarithmic in $C$, thus still resulting in cheap gradient updates; (ii) a mathematical proof that this adversarial sampling minimizes the gradient variance while any bias due to non-uniform sampling can be removed; (iii) experimental results on large scale data sets that show a reduction of the training time by an order of magnitude relative to several competitive baselines.
- Score: 23.943134990807756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training a classifier over a large number of classes, known as 'extreme
classification', has become a topic of major interest with applications in
technology, science, and e-commerce. Traditional softmax regression induces a
gradient cost proportional to the number of classes $C$, which often is
prohibitively expensive. A popular scalable softmax approximation relies on
uniform negative sampling, which suffers from slow convergence due a poor
signal-to-noise ratio. In this paper, we propose a simple training method for
drastically enhancing the gradient signal by drawing negative samples from an
adversarial model that mimics the data distribution. Our contributions are
three-fold: (i) an adversarial sampling mechanism that produces negative
samples at a cost only logarithmic in $C$, thus still resulting in cheap
gradient updates; (ii) a mathematical proof that this adversarial sampling
minimizes the gradient variance while any bias due to non-uniform sampling can
be removed; (iii) experimental results on large scale data sets that show a
reduction of the training time by an order of magnitude relative to several
competitive baselines.
Related papers
- Data Pruning via Moving-one-Sample-out [61.45441981346064]
We propose a novel data-pruning approach called moving-one-sample-out (MoSo)
MoSo aims to identify and remove the least informative samples from the training set.
Experimental results demonstrate that MoSo effectively mitigates severe performance degradation at high pruning ratios.
arXiv Detail & Related papers (2023-10-23T08:00:03Z) - Boosting Adversarial Transferability by Achieving Flat Local Maxima [23.91315978193527]
Recently, various adversarial attacks have emerged to boost adversarial transferability from different perspectives.
In this work, we assume and empirically validate that adversarial examples at a flat local region tend to have good transferability.
We propose an approximation optimization method to simplify the gradient update of the objective function.
arXiv Detail & Related papers (2023-06-08T14:21:02Z) - Compound Batch Normalization for Long-tailed Image Classification [77.42829178064807]
We propose a compound batch normalization method based on a Gaussian mixture.
It can model the feature space more comprehensively and reduce the dominance of head classes.
The proposed method outperforms existing methods on long-tailed image classification.
arXiv Detail & Related papers (2022-12-02T07:31:39Z) - Intra-class Adaptive Augmentation with Neighbor Correction for Deep
Metric Learning [99.14132861655223]
We propose a novel intra-class adaptive augmentation (IAA) framework for deep metric learning.
We reasonably estimate intra-class variations for every class and generate adaptive synthetic samples to support hard samples mining.
Our method significantly improves and outperforms the state-of-the-art methods on retrieval performances by 3%-6%.
arXiv Detail & Related papers (2022-11-29T14:52:38Z) - SIMPLE: A Gradient Estimator for $k$-Subset Sampling [42.38652558807518]
In this work, we fall back to discrete $k$-subset sampling on the forward pass.
We show that our gradient estimator, SIMPLE, exhibits lower bias and variance compared to state-of-the-art estimators.
Empirical results show improved performance on learning to explain and sparse linear regression.
arXiv Detail & Related papers (2022-10-04T22:33:16Z) - An adjoint-free algorithm for conditional nonlinear optimal perturbations (CNOPs) via sampling [5.758073912084367]
We propose a sampling algorithm based on state-of-the-art statistical machine learning techniques to obtain conditional nonlinear optimal perturbations (CNOPs)
The sampling approach directly reduces the gradient to the objective function value (zeroth-order information)
We demonstrate the CNOPs obtained with their spatial patterns, objective values, quantifying computation times, and nonlinear error growth.
arXiv Detail & Related papers (2022-08-01T16:07:22Z) - Multi-label Contrastive Predictive Coding [125.03510235962095]
Variational mutual information (MI) estimators are widely used in unsupervised representation learning methods such as contrastive predictive coding (CPC)
We introduce a novel estimator based on a multi-label classification problem, where the critic needs to jointly identify multiple positive samples at the same time.
We show that using the same amount of negative samples, multi-label CPC is able to exceed the $log m$ bound, while still being a valid lower bound of mutual information.
arXiv Detail & Related papers (2020-07-20T02:46:21Z) - Minimal Variance Sampling with Provable Guarantees for Fast Training of
Graph Neural Networks [22.618779809748435]
Existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization.
We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance.
We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization.
arXiv Detail & Related papers (2020-06-24T16:49:29Z) - Accelerated Convergence for Counterfactual Learning to Rank [65.63997193915257]
We show that convergence rate of SGD approaches with IPS-weighted gradients suffers from the large variance introduced by the IPS weights.
We propose a novel learning algorithm, called CounterSample, that has provably better convergence than standard IPS-weighted gradient descent methods.
We prove that CounterSample converges faster and complement our theoretical findings with empirical results.
arXiv Detail & Related papers (2020-05-21T12:53:36Z) - Addressing Class-Imbalance Problem in Personalized Ranking [47.11372043636176]
We propose an efficient emphunderlineVital underlineNegative underlineSampler (VINS) to alleviate the class-imbalance issue for pairwise ranking model.
VINS is a bias sampler with reject probability that will tend to accept a negative candidate with a larger degree weight than the given positive item.
arXiv Detail & Related papers (2020-05-19T08:11:26Z) - Compressing Large Sample Data for Discriminant Analysis [78.12073412066698]
We consider the computational issues due to large sample size within the discriminant analysis framework.
We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis.
arXiv Detail & Related papers (2020-05-08T05:09:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.