On the Importance of Adaptive Data Collection for Extremely Imbalanced
Pairwise Tasks
- URL: http://arxiv.org/abs/2010.05103v1
- Date: Sat, 10 Oct 2020 21:56:27 GMT
- Title: On the Importance of Adaptive Data Collection for Extremely Imbalanced
Pairwise Tasks
- Authors: Stephen Mussmann, Robin Jia, Percy Liang
- Abstract summary: We show that state-of-the art models trained on QQP and WikiQA each have only $2.4%$ average precision when evaluated on realistically imbalanced test data.
By creating balanced training data with more informative negative examples, active learning greatly improves average precision to $32.5%$ on QQP and $20.1%$ on WikiQA.
- Score: 94.23884467360521
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many pairwise classification tasks, such as paraphrase detection and
open-domain question answering, naturally have extreme label imbalance (e.g.,
$99.99\%$ of examples are negatives). In contrast, many recent datasets
heuristically choose examples to ensure label balance. We show that these
heuristics lead to trained models that generalize poorly: State-of-the art
models trained on QQP and WikiQA each have only $2.4\%$ average precision when
evaluated on realistically imbalanced test data. We instead collect training
data with active learning, using a BERT-based embedding model to efficiently
retrieve uncertain points from a very large pool of unlabeled utterance pairs.
By creating balanced training data with more informative negative examples,
active learning greatly improves average precision to $32.5\%$ on QQP and
$20.1\%$ on WikiQA.
Related papers
- Conformal-in-the-Loop for Learning with Imbalanced Noisy Data [5.69777817429044]
Class imbalance and label noise are pervasive in large-scale datasets.
Much of machine learning research assumes well-labeled, balanced data, which rarely reflects real world conditions.
We propose Conformal-in-the-Loop (CitL), a novel training framework that addresses both challenges with a conformal prediction-based approach.
arXiv Detail & Related papers (2024-11-04T17:09:58Z) - Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively
Tuning Pre-trained Code Models [38.7352992942213]
We propose a novel approach named HINT to improve pre-trained code models with large-scale unlabeled datasets.
HINT includes two main modules: HybrId pseudo-labeled data selection and Noise-tolerant Training.
The experimental results show that HINT can better leverage those unlabeled data in a task-specific way.
arXiv Detail & Related papers (2024-01-02T06:39:00Z) - Unsupervised Dense Retrieval with Relevance-Aware Contrastive
Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning.
We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks.
Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Adaptive Ranking-based Sample Selection for Weakly Supervised
Class-imbalanced Text Classification [4.151073288078749]
We propose Adaptive Ranking-based Sample Selection (ARS2) to alleviate the data imbalance issue in the weak supervision (WS) paradigm.
ARS2 calculates a probabilistic margin score based on the output of the current model to measure and rank the cleanliness of each data point.
Experiments show that ARS2 outperformed the state-of-the-art imbalanced learning and WS methods, leading to a 2%-57.8% improvement on their F1-score.
arXiv Detail & Related papers (2022-10-06T17:49:22Z) - BASIL: Balanced Active Semi-supervised Learning for Class Imbalanced
Datasets [14.739359755029353]
Current semi-supervised learning (SSL) methods assume a balance between the number of data points available for each class in both the labeled and the unlabeled data sets.
We propose BASIL, a novel algorithm that optimize the submodular mutual information (SMI) functions in a per-class fashion to gradually select a balanced dataset in an active learning loop.
arXiv Detail & Related papers (2022-03-10T21:34:08Z) - Improving Contrastive Learning on Imbalanced Seed Data via Open-World
Sampling [96.8742582581744]
We present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK)
MAK follows three simple principles: tailness, proximity, and diversity.
We demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features.
arXiv Detail & Related papers (2021-11-01T15:09:41Z) - Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models.
Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z) - Active learning for online training in imbalanced data streams under
cold start [0.8155575318208631]
We propose an Active Learning (AL) annotation system for datasets with orders of magnitude of class imbalance.
We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies.
The results show that our method can more quickly reach a high performance model than standard AL policies.
arXiv Detail & Related papers (2021-07-16T06:49:20Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.