Adaptive Ranking-based Sample Selection for Weakly Supervised
Class-imbalanced Text Classification
- URL: http://arxiv.org/abs/2210.03092v2
- Date: Fri, 7 Oct 2022 15:58:48 GMT
- Title: Adaptive Ranking-based Sample Selection for Weakly Supervised
Class-imbalanced Text Classification
- Authors: Linxin Song, Jieyu Zhang, Tianxiang Yang, Masayuki Goto
- Abstract summary: We propose Adaptive Ranking-based Sample Selection (ARS2) to alleviate the data imbalance issue in the weak supervision (WS) paradigm.
ARS2 calculates a probabilistic margin score based on the output of the current model to measure and rank the cleanliness of each data point.
Experiments show that ARS2 outperformed the state-of-the-art imbalanced learning and WS methods, leading to a 2%-57.8% improvement on their F1-score.
- Score: 4.151073288078749
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To obtain a large amount of training labels inexpensively, researchers have
recently adopted the weak supervision (WS) paradigm, which leverages labeling
rules to synthesize training labels rather than using individual annotations to
achieve competitive results for natural language processing (NLP) tasks.
However, data imbalance is often overlooked in applying the WS paradigm,
despite being a common issue in a variety of NLP tasks. To address this
challenge, we propose Adaptive Ranking-based Sample Selection (ARS2), a
model-agnostic framework to alleviate the data imbalance issue in the WS
paradigm. Specifically, it calculates a probabilistic margin score based on the
output of the current model to measure and rank the cleanliness of each data
point. Then, the ranked data are sampled based on both class-wise and
rule-aware ranking. In particular, the two sample strategies corresponds to our
motivations: (1) to train the model with balanced data batches to reduce the
data imbalance issue and (2) to exploit the expertise of each labeling rule for
collecting clean samples. Experiments on four text classification datasets with
four different imbalance ratios show that ARS2 outperformed the
state-of-the-art imbalanced learning and WS methods, leading to a 2%-57.8%
improvement on their F1-score.
Related papers
- Conformal-in-the-Loop for Learning with Imbalanced Noisy Data [5.69777817429044]
Class imbalance and label noise are pervasive in large-scale datasets.
Much of machine learning research assumes well-labeled, balanced data, which rarely reflects real world conditions.
We propose Conformal-in-the-Loop (CitL), a novel training framework that addresses both challenges with a conformal prediction-based approach.
arXiv Detail & Related papers (2024-11-04T17:09:58Z) - Foster Adaptivity and Balance in Learning with Noisy Labels [26.309508654960354]
We propose a novel approach named textbfSED to deal with label noise in a textbfSelf-adaptivtextbfE and class-balancetextbfD manner.
A mean-teacher model is then employed to correct labels of noisy samples.
We additionally propose a self-adaptive and class-balanced sample re-weighting mechanism to assign different weights to detected noisy samples.
arXiv Detail & Related papers (2024-07-03T03:10:24Z) - Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Learning with Imbalanced Noisy Data by Preventing Bias in Sample
Selection [82.43311784594384]
Real-world datasets contain not only noisy labels but also class imbalance.
We propose a simple yet effective method to address noisy labels in imbalanced datasets.
arXiv Detail & Related papers (2024-02-17T10:34:53Z) - Debiased Sample Selection for Combating Noisy Labels [24.296451733127956]
We propose a noIse-Tolerant Expert Model (ITEM) for debiased learning in sample selection.
Specifically, to mitigate the training bias, we design a robust network architecture that integrates with multiple experts.
By training on the mixture of two class-discriminative mini-batches, the model mitigates the effect of the imbalanced training set.
arXiv Detail & Related papers (2024-01-24T10:37:28Z) - Twice Class Bias Correction for Imbalanced Semi-Supervised Learning [59.90429949214134]
We introduce a novel approach called textbfTwice textbfClass textbfBias textbfCorrection (textbfTCBC)
We estimate the class bias of the model parameters during the training process.
We apply a secondary correction to the model's pseudo-labels for unlabeled samples.
arXiv Detail & Related papers (2023-12-27T15:06:36Z) - Bias Amplification Enhances Minority Group Performance [10.380812738348899]
We propose BAM, a novel two-stage training algorithm.
In the first stage, the model is trained using a bias amplification scheme via introducing a learnable auxiliary variable for each training sample.
In the second stage, we upweight the samples that the bias-amplified model misclassifies, and then continue training the same model on the reweighted dataset.
arXiv Detail & Related papers (2023-09-13T04:40:08Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Improving Contrastive Learning on Imbalanced Seed Data via Open-World
Sampling [96.8742582581744]
We present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK)
MAK follows three simple principles: tailness, proximity, and diversity.
We demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features.
arXiv Detail & Related papers (2021-11-01T15:09:41Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.