Class prior estimation for positive-unlabeled learning when label shift occurs
- URL: http://arxiv.org/abs/2502.21194v1
- Date: Fri, 28 Feb 2025 16:12:53 GMT
- Title: Class prior estimation for positive-unlabeled learning when label shift occurs
- Authors: Jan Mielniczuk, Wojciech Rejchel, Paweł Teisseyre,
- Abstract summary: We introduce a novel direct estimator of class prior which avoids estimation of posterior probabilities.<n>It is based on a distribution matching technique together with kernel embedding and is obtained as an explicit solution to an optimisation task.<n>We study finite sample behaviour for synthetic and real data and show that the proposal, together with a suitably modified version for large values of source prior, works on par or better than its competitors.
- Score: 1.0514231683620516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study estimation of class prior for unlabeled target samples which is possibly different from that of source population. It is assumed that for the source data only samples from positive class and from the whole population are available (PU learning scenario). We introduce a novel direct estimator of class prior which avoids estimation of posterior probabilities and has a simple geometric interpretation. It is based on a distribution matching technique together with kernel embedding and is obtained as an explicit solution to an optimisation task. We establish its asymptotic consistency as well as a non-asymptotic bound on its deviation from the unknown prior, which is calculable in practice. We study finite sample behaviour for synthetic and real data and show that the proposal, together with a suitably modified version for large values of source prior, works on par or better than its competitors.
Related papers
- Active Data Sampling and Generation for Bias Remediation [0.0]
A mixed active sampling and data generation strategy -- called samplation -- is proposed to compensate during fine-tuning of a pre-trained classifer the unfair classifications it produces.
Using as case study Deep Models for visual semantic role labeling, the proposed method has been able to fully cure a simulated gender bias starting from a 90/10 imbalance.
arXiv Detail & Related papers (2025-03-26T10:42:15Z) - Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning [49.94607673097326]
We propose a highly adaptable framework, designated as SimPro, which does not rely on any predefined assumptions about the distribution of unlabeled data.
Our framework, grounded in a probabilistic model, innovatively refines the expectation-maximization algorithm.
Our method showcases consistent state-of-the-art performance across diverse benchmarks and data distribution scenarios.
arXiv Detail & Related papers (2024-02-21T03:39:04Z) - Joint empirical risk minimization for instance-dependent
positive-unlabeled data [4.112909937203119]
Learning from positive and unlabeled data (PU learning) is actively researched machine learning task.
The goal is to train a binary classification model based on a dataset containing part on positives which are labeled, and unlabeled instances.
Unlabeled set includes remaining part positives and all negative observations.
arXiv Detail & Related papers (2023-12-27T12:45:12Z) - Open-Sampling: Exploring Out-of-Distribution data for Re-balancing
Long-tailed datasets [24.551465814633325]
Deep neural networks usually perform poorly when the training dataset suffers from extreme class imbalance.
Recent studies found that directly training with out-of-distribution data in a semi-supervised manner would harm the generalization performance.
We propose a novel method called Open-sampling, which utilizes open-set noisy labels to re-balance the class priors of the training dataset.
arXiv Detail & Related papers (2022-06-17T14:29:52Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z) - Meta-Learning Conjugate Priors for Few-Shot Bayesian Optimization [0.0]
We propose a novel approach to utilize meta-learning to automate the estimation of informative conjugate prior distributions.
From this process we generate priors that require only few data to estimate the shape parameters of the original distribution of the data.
arXiv Detail & Related papers (2021-01-03T23:58:32Z) - Performance-Agnostic Fusion of Probabilistic Classifier Outputs [2.4206828137867107]
We propose a method for combining probabilistic outputs of classifiers to make a single consensus class prediction.
Our proposed method works well in situations where accuracy is the performance metric.
It does not output calibrated probabilities, so it is not suitable in situations where such probabilities are required for further processing.
arXiv Detail & Related papers (2020-09-01T16:53:29Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - Bayesian Semi-supervised Multi-category Classification under Nonparanormality [2.307581190124002]
Semi-supervised learning is a model training method that uses both labeled and unlabeled data.
This paper proposes a fully Bayes semi-supervised learning algorithm that can be applied to any multi-category classification problem.
arXiv Detail & Related papers (2020-01-11T21:31:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.