Split-PU: Hardness-aware Training Strategy for Positive-Unlabeled
Learning
- URL: http://arxiv.org/abs/2211.16756v1
- Date: Wed, 30 Nov 2022 05:48:31 GMT
- Title: Split-PU: Hardness-aware Training Strategy for Positive-Unlabeled
Learning
- Authors: Chengming Xu, Chen Liu, Siqian Yang, Yabiao Wang, Shijie Zhang, Lijie
Jia, Yanwei Fu
- Abstract summary: Positive-Unlabeled (PU) learning aims to learn a model with rare positive samples and abundant unlabeled samples.
This paper focuses on improving the commonly-used nnPU with a novel training pipeline.
- Score: 42.26185670834855
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Positive-Unlabeled (PU) learning aims to learn a model with rare positive
samples and abundant unlabeled samples. Compared with classical binary
classification, the task of PU learning is much more challenging due to the
existence of many incompletely-annotated data instances. Since only part of the
most confident positive samples are available and evidence is not enough to
categorize the rest samples, many of these unlabeled data may also be the
positive samples. Research on this topic is particularly useful and essential
to many real-world tasks which demand very expensive labelling cost. For
example, the recognition tasks in disease diagnosis, recommendation system and
satellite image recognition may only have few positive samples that can be
annotated by the experts. These methods mainly omit the intrinsic hardness of
some unlabeled data, which can result in sub-optimal performance as a
consequence of fitting the easy noisy data and not sufficiently utilizing the
hard data. In this paper, we focus on improving the commonly-used nnPU with a
novel training pipeline. We highlight the intrinsic difference of hardness of
samples in the dataset and the proper learning strategies for easy and hard
data. By considering this fact, we propose first splitting the unlabeled
dataset with an early-stop strategy. The samples that have inconsistent
predictions between the temporary and base model are considered as hard
samples. Then the model utilizes a noise-tolerant Jensen-Shannon divergence
loss for easy data; and a dual-source consistency regularization for hard data
which includes a cross-consistency between student and base model for low-level
features and self-consistency for high-level features and predictions,
respectively.
Related papers
- Learning with Instance-Dependent Noisy Labels by Anchor Hallucination and Hard Sample Label Correction [12.317154103998433]
Traditional Noisy-Label Learning (NLL) methods categorize training data into clean and noisy sets based on the loss distribution of training samples.
Our approach explicitly distinguishes between clean vs.noisy and easy vs. hard samples.
Corrected hard samples, along with the easy samples, are used as labeled data in subsequent semi-supervised training.
arXiv Detail & Related papers (2024-07-10T03:00:14Z) - Late Stopping: Avoiding Confidently Learning from Mislabeled Examples [61.00103151680946]
We propose a new framework, Late Stopping, which leverages the intrinsic robust learning ability of DNNs through a prolonged training process.
We empirically observe that mislabeled and clean examples exhibit differences in the number of epochs required for them to be consistently and correctly classified.
Experimental results on benchmark-simulated and real-world noisy datasets demonstrate that the proposed method outperforms state-of-the-art counterparts.
arXiv Detail & Related papers (2023-08-26T12:43:25Z) - Robust Positive-Unlabeled Learning via Noise Negative Sample
Self-correction [48.929877651182885]
Learning from positive and unlabeled data is known as positive-unlabeled (PU) learning in literature.
We propose a new robust PU learning method with a training strategy motivated by the nature of human learning.
arXiv Detail & Related papers (2023-08-01T04:34:52Z) - Differences Between Hard and Noisy-labeled Samples: An Empirical Study [7.132368785057315]
noisy or incorrectly labeled samples from a labeled dataset with hard/difficult samples is an important yet under-explored topic.
We introduce a simple yet effective metric that filters out noisy-labeled samples while keeping the hard samples.
Our proposed data partitioning method significantly outperforms other methods when employed within a semi-supervised learning framework.
arXiv Detail & Related papers (2023-07-20T09:24:23Z) - Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss.
Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z) - DiscrimLoss: A Universal Loss for Hard Samples and Incorrect Samples
Discrimination [28.599571524763785]
Given data with label noise (i.e., incorrect data), deep neural networks would gradually memorize the label noise and impair model performance.
To relieve this issue, curriculum learning is proposed to improve model performance and generalization by ordering training samples in a meaningful sequence.
arXiv Detail & Related papers (2022-08-21T13:38:55Z) - Label-Noise Learning with Intrinsically Long-Tailed Data [65.41318436799993]
We propose a learning framework for label-noise learning with intrinsically long-tailed data.
Specifically, we propose two-stage bi-dimensional sample selection (TABASCO) to better separate clean samples from noisy samples.
arXiv Detail & Related papers (2022-08-21T07:47:05Z) - An analysis of over-sampling labeled data in semi-supervised learning
with FixMatch [66.34968300128631]
Most semi-supervised learning methods over-sample labeled data when constructing training mini-batches.
This paper studies whether this common practice improves learning and how.
We compare it to an alternative setting where each mini-batch is uniformly sampled from all the training data, labeled or not.
arXiv Detail & Related papers (2022-01-03T12:22:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.