Active Learning for Skewed Data Sets
- URL: http://arxiv.org/abs/2005.11442v1
- Date: Sat, 23 May 2020 01:50:19 GMT
- Title: Active Learning for Skewed Data Sets
- Authors: Abbas Kazerouni and Qi Zhao and Jing Xie and Sandeep Tata and Marc
Najork
- Abstract summary: We focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of initial training data.
We propose a hybrid active learning algorithm (HAL) that balances exploiting the knowledge available through the currently labeled training examples with exploring the large amount of unlabeled data.
- Score: 25.866341631677688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Consider a sequential active learning problem where, at each round, an agent
selects a batch of unlabeled data points, queries their labels and updates a
binary classifier. While there exists a rich body of work on active learning in
this general form, in this paper, we focus on problems with two distinguishing
characteristics: severe class imbalance (skew) and small amounts of initial
training data. Both of these problems occur with surprising frequency in many
web applications. For instance, detecting offensive or sensitive content in
online communities (pornography, violence, and hate-speech) is receiving
enormous attention from industry as well as research communities. Such problems
have both the characteristics we describe -- a vast majority of content is not
offensive, so the number of positive examples for such content is orders of
magnitude smaller than the negative examples. Furthermore, there is usually
only a small amount of initial training data available when building
machine-learned models to solve such problems. To address both these issues, we
propose a hybrid active learning algorithm (HAL) that balances exploiting the
knowledge available through the currently labeled training examples with
exploring the large amount of unlabeled data available. Through simulation
results, we show that HAL makes significantly better choices for what points to
label when compared to strong baselines like margin-sampling. Classifiers
trained on the examples selected for labeling by HAL easily out-perform the
baselines on target metrics (like area under the precision-recall curve) given
the same budget for labeling examples. We believe HAL offers a simple,
intuitive, and computationally tractable way to structure active learning for a
wide range of machine learning applications.
Related papers
- Probably Approximately Precision and Recall Learning [62.912015491907994]
Precision and Recall are foundational metrics in machine learning.
One-sided feedback--where only positive examples are observed during training--is inherent in many practical problems.
We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions.
arXiv Detail & Related papers (2024-11-20T04:21:07Z) - DIRECT: Deep Active Learning under Imbalance and Label Noise [15.571923343398657]
We conduct the first study of active learning under both class imbalance and label noise.
We propose a novel algorithm that robustly identifies the class separation threshold and annotates the most uncertain examples.
Our results demonstrate that DIRECT can save more than 60% of the annotation budget compared to state-of-art active learning algorithms.
arXiv Detail & Related papers (2023-12-14T18:18:34Z) - HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques [48.82319198853359]
HardVis is a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios.
Users can explore subsets of data from different perspectives to decide all those parameters.
The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case.
arXiv Detail & Related papers (2022-03-29T17:04:16Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - L2B: Learning to Bootstrap Robust Models for Combating Label Noise [52.02335367411447]
This paper introduces a simple and effective method, named Learning to Bootstrap (L2B)
It enables models to bootstrap themselves using their own predictions without being adversely affected by erroneous pseudo-labels.
It achieves this by dynamically adjusting the importance weight between real observed and generated labels, as well as between different samples through meta-learning.
arXiv Detail & Related papers (2022-02-09T05:57:08Z) - Improving Contrastive Learning on Imbalanced Seed Data via Open-World
Sampling [96.8742582581744]
We present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK)
MAK follows three simple principles: tailness, proximity, and diversity.
We demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features.
arXiv Detail & Related papers (2021-11-01T15:09:41Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - Deep Active Learning via Open Set Recognition [0.0]
In many applications, data is easy to acquire but expensive and time-consuming to label prominent examples.
We formulate active learning as an open-set recognition problem.
Unlike current active learning methods, our algorithm can learn tasks without the need for task labels.
arXiv Detail & Related papers (2020-07-04T22:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.