Active learning for online training in imbalanced data streams under
cold start
- URL: http://arxiv.org/abs/2107.07724v1
- Date: Fri, 16 Jul 2021 06:49:20 GMT
- Title: Active learning for online training in imbalanced data streams under
cold start
- Authors: Ricardo Barata, Miguel Leite, Ricardo Pacheco, Marco O. P. Sampaio,
Jo\~ao Tiago Ascens\~ao, Pedro Bizarro
- Abstract summary: We propose an Active Learning (AL) annotation system for datasets with orders of magnitude of class imbalance.
We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies.
The results show that our method can more quickly reach a high performance model than standard AL policies.
- Score: 0.8155575318208631
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Labeled data is essential in modern systems that rely on Machine Learning
(ML) for predictive modelling. Such systems may suffer from the cold-start
problem: supervised models work well but, initially, there are no labels, which
are costly or slow to obtain. This problem is even worse in imbalanced data
scenarios. Online financial fraud detection is an example where labeling is: i)
expensive, or ii) it suffers from long delays, if relying on victims filing
complaints. The latter may not be viable if a model has to be in place
immediately, so an option is to ask analysts to label events while minimizing
the number of annotations to control costs. We propose an Active Learning (AL)
annotation system for datasets with orders of magnitude of class imbalance, in
a cold start streaming scenario. We present a computationally efficient
Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage
sequence of AL labeling policies where it is used as warm-up. Then, we perform
empirical studies in four real world datasets, with various magnitudes of class
imbalance. The results show that our method can more quickly reach a high
performance model than standard AL policies. Its observed gains over random
sampling can reach 80% and be competitive with policies with an unlimited
annotation budget or additional historical data (with 1/10 to 1/50 of the
labels).
Related papers
- Learning with Imbalanced Noisy Data by Preventing Bias in Sample
Selection [82.43311784594384]
Real-world datasets contain not only noisy labels but also class imbalance.
We propose a simple yet effective method to address noisy labels in imbalanced datasets.
arXiv Detail & Related papers (2024-02-17T10:34:53Z) - Bridging the Gap: Learning Pace Synchronization for Open-World Semi-Supervised Learning [44.91863420044712]
In open-world semi-supervised learning, a machine learning model is tasked with uncovering novel categories from unlabeled data.
We introduce 1) the adaptive synchronizing marginal loss which imposes class-specific negative margins to alleviate the model bias towards seen classes, and 2) the pseudo-label contrastive clustering which exploits pseudo-labels predicted by the model to group unlabeled data from the same category together.
Our method balances the learning pace between seen and novel classes, achieving a remarkable 3% average accuracy increase on the ImageNet dataset.
arXiv Detail & Related papers (2023-09-21T09:44:39Z) - Online Label Shift: Optimal Dynamic Regret meets Practical Algorithms [33.61487362513345]
This paper focuses on supervised and unsupervised online label shift, where the class marginals $Q(y)$ varies but the class-conditionals $Q(x|y)$ remain invariant.
In the unsupervised setting, our goal is to adapt a learner, trained on some offline labeled data, to changing label distributions given unlabeled online data.
We develop novel algorithms that reduce the adaptation problem to online regression and guarantee optimal dynamic regret without any prior knowledge of the extent of drift in the label distribution.
arXiv Detail & Related papers (2023-05-31T05:39:52Z) - Label-Retrieval-Augmented Diffusion Models for Learning from Noisy
Labels [61.97359362447732]
Learning from noisy labels is an important and long-standing problem in machine learning for real applications.
In this paper, we reformulate the label-noise problem from a generative-model perspective.
Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets.
arXiv Detail & Related papers (2023-05-31T03:01:36Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - How to Leverage Unlabeled Data in Offline Reinforcement Learning [125.72601809192365]
offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition.
One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data.
We find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing.
arXiv Detail & Related papers (2022-02-03T18:04:54Z) - Improving Contrastive Learning on Imbalanced Seed Data via Open-World
Sampling [96.8742582581744]
We present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK)
MAK follows three simple principles: tailness, proximity, and diversity.
We demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features.
arXiv Detail & Related papers (2021-11-01T15:09:41Z) - Online Fairness-Aware Learning with Imbalanced Data Streams [9.481178205985396]
We propose ours, an online fairness-aware approach that maintains a valid and fair classifier over the stream.
oursis an online boosting approach that changes the training distribution in an online fashion by monitoring stream's class imbalance.
Experiments on 8 real-world and 1 synthetic datasets demonstrate the superiority of our method over state-of-the-art fairness-aware stream approaches.
arXiv Detail & Related papers (2021-08-13T13:31:42Z) - Identifying Wrongly Predicted Samples: A Method for Active Learning [6.976600214375139]
We propose a simple sample selection criterion that moves beyond uncertainty.
We show state-of-the-art results and better rates at identifying wrongly predicted samples.
arXiv Detail & Related papers (2020-10-14T09:00:42Z) - On the Importance of Adaptive Data Collection for Extremely Imbalanced
Pairwise Tasks [94.23884467360521]
We show that state-of-the art models trained on QQP and WikiQA each have only $2.4%$ average precision when evaluated on realistically imbalanced test data.
By creating balanced training data with more informative negative examples, active learning greatly improves average precision to $32.5%$ on QQP and $20.1%$ on WikiQA.
arXiv Detail & Related papers (2020-10-10T21:56:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.