Doubly-stochastic mining for heterogeneous retrieval
- URL: http://arxiv.org/abs/2004.10915v1
- Date: Thu, 23 Apr 2020 00:43:13 GMT
- Title: Doubly-stochastic mining for heterogeneous retrieval
- Authors: Ankit Singh Rawat, Aditya Krishna Menon, Andreas Veit, Felix Yu,
Sashank J. Reddi, Sanjiv Kumar
- Abstract summary: Modern retrieval problems are characterised by training sets with potentially billions of labels.
With a large number of labels, standard losses are difficult to optimise even on a single example.
We propose doubly-stochastic mining (S2M) to address both challenges.
- Score: 74.43785301907276
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Modern retrieval problems are characterised by training sets with potentially
billions of labels, and heterogeneous data distributions across subpopulations
(e.g., users of a retrieval system may be from different countries), each of
which poses a challenge. The first challenge concerns scalability: with a large
number of labels, standard losses are difficult to optimise even on a single
example. The second challenge concerns uniformity: one ideally wants good
performance on each subpopulation. While several solutions have been proposed
to address the first challenge, the second challenge has received relatively
less attention. In this paper, we propose doubly-stochastic mining (S2M ), a
stochastic optimization technique that addresses both challenges. In each
iteration of S2M, we compute a per-example loss based on a subset of hardest
labels, and then compute the minibatch loss based on the hardest examples. We
show theoretically and empirically that by focusing on the hardest examples,
S2M ensures that all data subpopulations are modelled well.
Related papers
- A Survey on Small Sample Imbalance Problem: Metrics, Feature Analysis, and Solutions [41.77642958758829]
The small sample imbalance (S&I) problem is a major challenge in machine learning and data analysis.
Existing methods often rely on algorithmics without sufficiently analyzing the underlying data characteristics.
We argue that a detailed analysis from the data perspective is essential before developing an appropriate solution.
arXiv Detail & Related papers (2025-04-21T01:58:29Z) - SP$^2$OT: Semantic-Regularized Progressive Partial Optimal Transport for Imbalanced Clustering [14.880015659013681]
We introduce a novel optimal transport-based pseudo-label learning framework.
Our framework formulates pseudo-label generation as a Semantic-regularized Progressive Partial Optimal Transport problem.
We employ the strategy of majorization to reformulate the SP$2$OT problem into a Progressive Partial Optimal Transport problem.
arXiv Detail & Related papers (2024-04-04T13:46:52Z) - P$^2$OT: Progressive Partial Optimal Transport for Deep Imbalanced
Clustering [16.723646401890495]
We propose a novel pseudo-labeling-based learning framework for deep clustering.
Our framework generates imbalance-aware pseudo-labels and learning from high-confident samples.
Experiments on various datasets, including a human-curated long-tailed CIFAR100, demonstrate the superiority of our method.
arXiv Detail & Related papers (2024-01-17T15:15:46Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Recovering Top-Two Answers and Confusion Probability in Multi-Choice
Crowdsourcing [10.508187462682308]
We consider crowdsourcing tasks with the goal of recovering not only the ground truth, but also the most confusing answer and the confusion probability.
We propose a model in which there are the top two plausible answers for each task, distinguished from the rest of the choices.
Under this model, we propose a two-stage inference algorithm to infer both the top two answers and the confusion probability.
arXiv Detail & Related papers (2022-12-29T09:46:39Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten.
We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly.
SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
arXiv Detail & Related papers (2022-10-26T21:03:46Z) - HyP$^2$ Loss: Beyond Hypersphere Metric Space for Multi-label Image
Retrieval [20.53316810731414]
We propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP$2$ Loss)
The proposed HyP$2$ Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs.
arXiv Detail & Related papers (2022-08-14T15:06:27Z) - Two-Stage Stochastic Optimization via Primal-Dual Decomposition and Deep
Unrolling [86.85697555068168]
Two-stage algorithmic optimization plays a critical role in various engineering and scientific applications.
There still lack efficient algorithms, especially when the long-term and short-term variables are coupled in the constraints.
We show that PDD-SSCA can achieve superior performance over existing solutions.
arXiv Detail & Related papers (2021-05-05T03:36:00Z) - SuctionNet-1Billion: A Large-Scale Benchmark for Suction Grasping [47.221326169627666]
We propose a new physical model to analytically evaluate seal formation and wrench resistance of a suction grasping.
A two-step methodology is adopted to generate annotations on a large-scale dataset collected in real-world cluttered scenarios.
A standard online evaluation system is proposed to evaluate suction poses in continuous operation space.
arXiv Detail & Related papers (2021-03-23T05:02:52Z) - The Simulator: Understanding Adaptive Sampling in the
Moderate-Confidence Regime [52.38455827779212]
We propose a novel technique for analyzing adaptive sampling called the em Simulator.
We prove the first instance-based lower bounds the top-k problem which incorporate the appropriate log-factors.
Our new analysis inspires a simple and near-optimal for the best-arm and top-k identification, the first em practical of its kind for the latter problem.
arXiv Detail & Related papers (2017-02-16T23:42:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.