Improving Positive Unlabeled Learning: Practical AUL Estimation and New
Training Method for Extremely Imbalanced Data Sets
- URL: http://arxiv.org/abs/2004.09820v1
- Date: Tue, 21 Apr 2020 08:32:57 GMT
- Title: Improving Positive Unlabeled Learning: Practical AUL Estimation and New
Training Method for Extremely Imbalanced Data Sets
- Authors: Liwei Jiang, Dan Li, Qisheng Wang, Shuai Wang, Songtao Wang
- Abstract summary: We improve Positive Unlabeled (PU) learning over state-of-the-art from two aspects.
First, we propose an unbiased practical AUL estimation method, which makes use of raw PU data without prior knowledge of unlabeled samples.
Secondly, we propose ProbTagging, a new training method for extremely imbalanced data sets.
- Score: 10.870831090350402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Positive Unlabeled (PU) learning is widely used in many applications, where a
binary classifier is trained on the datasets consisting of only positive and
unlabeled samples. In this paper, we improve PU learning over state-of-the-art
from two aspects. Firstly, existing model evaluation methods for PU learning
requires ground truth of unlabeled samples, which is unlikely to be obtained in
practice. In order to release this restriction, we propose an asymptotic
unbiased practical AUL (area under the lift) estimation method, which makes use
of raw PU data without prior knowledge of unlabeled samples.
Secondly, we propose ProbTagging, a new training method for extremely
imbalanced data sets, where the number of unlabeled samples is hundreds or
thousands of times that of positive samples. ProbTagging introduces probability
into the aggregation method. Specifically, each unlabeled sample is tagged
positive or negative with the probability calculated based on the similarity to
its positive neighbors. Based on this, multiple data sets are generated to
train different models, which are then combined into an ensemble model.
Compared to state-of-the-art work, the experimental results show that
ProbTagging can increase the AUC by up to 10%, based on three industrial and
two artificial PU data sets.
Related papers
- Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - Robust Positive-Unlabeled Learning via Noise Negative Sample
Self-correction [48.929877651182885]
Learning from positive and unlabeled data is known as positive-unlabeled (PU) learning in literature.
We propose a new robust PU learning method with a training strategy motivated by the nature of human learning.
arXiv Detail & Related papers (2023-08-01T04:34:52Z) - Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold [2.76815720120527]
Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification.
PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain.
We propose two PU learning algorithms to estimate $alpha$, calculate probabilities for PU instances, and improve classification metrics.
arXiv Detail & Related papers (2023-03-14T23:16:22Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - Positive Unlabeled Contrastive Learning [14.975173394072053]
We extend the self-supervised pretraining paradigm to the classical positive unlabeled (PU) setting.
We develop a simple methodology to pseudo-label the unlabeled samples using a new PU-specific clustering scheme.
Our method handily outperforms state-of-the-art PU methods over several standard PU benchmark datasets.
arXiv Detail & Related papers (2022-06-01T20:16:32Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - Positive-Unlabeled Classification under Class-Prior Shift: A
Prior-invariant Approach Based on Density Ratio Estimation [85.75352990739154]
We propose a novel PU classification method based on density ratio estimation.
A notable advantage of our proposed method is that it does not require the class-priors in the training phase.
arXiv Detail & Related papers (2021-07-11T13:36:53Z) - Rethinking InfoNCE: How Many Negative Samples Do You Need? [54.146208195806636]
We study how many negative samples are optimal for InfoNCE in different scenarios via a semi-quantitative theoretical framework.
We estimate the optimal negative sampling ratio using the $K$ value that maximizes the training effectiveness function.
arXiv Detail & Related papers (2021-05-27T08:38:29Z) - MixPUL: Consistency-based Augmentation for Positive and Unlabeled
Learning [8.7382177147041]
We propose a simple yet effective data augmentation method, coinedalgo, based on emphconsistency regularization.
algoincorporates supervised and unsupervised consistency training to generate augmented data.
We show thatalgoachieves an averaged improvement of classification error from 16.49 to 13.09 on the CIFAR-10 dataset across different positive data amount.
arXiv Detail & Related papers (2020-04-20T15:43:33Z) - Learning from Positive and Unlabeled Data with Arbitrary Positive Shift [11.663072799764542]
This paper shows that PU learning is possible even with arbitrarily non-representative positive data given unlabeled data.
We integrate this into two statistically consistent methods to address arbitrary positive bias.
Experimental results demonstrate our methods' effectiveness across numerous real-world datasets.
arXiv Detail & Related papers (2020-02-24T13:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.