Related papers: Active Learning for Skewed Data Sets

Active Learning for Skewed Data Sets

URL: http://arxiv.org/abs/2005.11442v1
Date: Sat, 23 May 2020 01:50:19 GMT
Title: Active Learning for Skewed Data Sets
Authors: Abbas Kazerouni and Qi Zhao and Jing Xie and Sandeep Tata and Marc Najork
Abstract summary: We focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of initial training data. We propose a hybrid active learning algorithm (HAL) that balances exploiting the knowledge available through the currently labeled training examples with exploring the large amount of unlabeled data.
Score: 25.866341631677688
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Consider a sequential active learning problem where, at each round, an agent selects a batch of unlabeled data points, queries their labels and updates a binary classifier. While there exists a rich body of work on active learning in this general form, in this paper, we focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of initial training data. Both of these problems occur with surprising frequency in many web applications. For instance, detecting offensive or sensitive content in online communities (pornography, violence, and hate-speech) is receiving enormous attention from industry as well as research communities. Such problems have both the characteristics we describe -- a vast majority of content is not offensive, so the number of positive examples for such content is orders of magnitude smaller than the negative examples. Furthermore, there is usually only a small amount of initial training data available when building machine-learned models to solve such problems. To address both these issues, we propose a hybrid active learning algorithm (HAL) that balances exploiting the knowledge available through the currently labeled training examples with exploring the large amount of unlabeled data available. Through simulation results, we show that HAL makes significantly better choices for what points to label when compared to strong baselines like margin-sampling. Classifiers trained on the examples selected for labeling by HAL easily out-perform the baselines on target metrics (like area under the precision-recall curve) given the same budget for labeling examples. We believe HAL offers a simple, intuitive, and computationally tractable way to structure active learning for a wide range of machine learning applications.

Related papers

Probably Approximately Precision and Recall Learning [62.912015491907994]
Precision and Recall are foundational metrics in machine learning. One-sided feedback--where only positive examples are observed during training--is inherent in many practical problems. We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions.
arXiv Detail & Related papers (2024-11-20T04:21:07Z)
DIRECT: Deep Active Learning under Imbalance and Label Noise [15.571923343398657]
We conduct the first study of active learning under both class imbalance and label noise. We propose a novel algorithm that robustly identifies the class separation threshold and annotates the most uncertain examples. Our results demonstrate that DIRECT can save more than 60% of the annotation budget compared to state-of-art active learning algorithms.
arXiv Detail & Related papers (2023-12-14T18:18:34Z)
HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques [48.82319198853359]
HardVis is a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Users can explore subsets of data from different perspectives to decide all those parameters. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case.
arXiv Detail & Related papers (2022-03-29T17:04:16Z)
CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance. Sample re-weighting methods are popularly used to alleviate this data bias issue. We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z)
L2B: Learning to Bootstrap Robust Models for Combating Label Noise [52.02335367411447]
This paper introduces a simple and effective method, named Learning to Bootstrap (L2B) It enables models to bootstrap themselves using their own predictions without being adversely affected by erroneous pseudo-labels. It achieves this by dynamically adjusting the importance weight between real observed and generated labels, as well as between different samples through meta-learning.
arXiv Detail & Related papers (2022-02-09T05:57:08Z)
Improving Contrastive Learning on Imbalanced Seed Data via Open-World Sampling [96.8742582581744]
We present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK) MAK follows three simple principles: tailness, proximity, and diversity. We demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features.
arXiv Detail & Related papers (2021-11-01T15:09:41Z)
CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps. We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z)
Deep Active Learning via Open Set Recognition [0.0]
In many applications, data is easy to acquire but expensive and time-consuming to label prominent examples. We formulate active learning as an open-set recognition problem. Unlike current active learning methods, our algorithm can learn tasks without the need for task labels.
arXiv Detail & Related papers (2020-07-04T22:09:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.