Semi-supervised learning objectives as log-likelihoods in a generative
model of data curation
- URL: http://arxiv.org/abs/2008.05913v2
- Date: Fri, 8 Oct 2021 06:49:36 GMT
- Title: Semi-supervised learning objectives as log-likelihoods in a generative
model of data curation
- Authors: Stoil Ganev, Laurence Aitchison
- Abstract summary: We formulate SSL objectives as a log-likelihood in a generative model of data curation.
We give a proof-of-principle for Bayesian SSL on toy data.
- Score: 32.45282187405337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We currently do not have an understanding of semi-supervised learning (SSL)
objectives such as pseudo-labelling and entropy minimization as
log-likelihoods, which precludes the development of e.g. Bayesian SSL. Here, we
note that benchmark image datasets such as CIFAR-10 are carefully curated, and
we formulate SSL objectives as a log-likelihood in a generative model of data
curation that was initially developed to explain the cold-posterior effect
(Aitchison 2020). SSL methods, from entropy minimization and pseudo-labelling,
to state-of-the-art techniques similar to FixMatch can be understood as
lower-bounds on our principled log-likelihood. We are thus able to give a
proof-of-principle for Bayesian SSL on toy data. Finally, our theory suggests
that SSL is effective in part due to the statistical patterns induced by data
curation. This provides an explanation of past results which show SSL performs
better on clean datasets without any "out of distribution" examples. Confirming
these results we find that SSL gave much larger performance improvements on
curated than on uncurated data, using matched curated and uncurated datasets
based on Galaxy Zoo 2.
Related papers
- Semi-Supervised Sparse Gaussian Classification: Provable Benefits of Unlabeled Data [6.812609988733991]
We study SSL for high dimensional Gaussian classification.
We analyze information theoretic lower bounds for accurate feature selection.
We present simulations that complement our theoretical analysis.
arXiv Detail & Related papers (2024-09-05T08:21:05Z) - A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels.
We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z) - Reinforcement Learning-Guided Semi-Supervised Learning [20.599506122857328]
We propose a novel Reinforcement Learning Guided SSL method, RLGSSL, that formulates SSL as a one-armed bandit problem.
RLGSSL incorporates a carefully designed reward function that balances the use of labeled and unlabeled data to enhance generalization performance.
We demonstrate the effectiveness of RLGSSL through extensive experiments on several benchmark datasets and show that our approach achieves consistent superior performance compared to state-of-the-art SSL methods.
arXiv Detail & Related papers (2024-05-02T21:52:24Z) - Can semi-supervised learning use all the data effectively? A lower bound
perspective [58.71657561857055]
We show that semi-supervised learning algorithms can leverage unlabeled data to improve over the labeled sample complexity of supervised learning algorithms.
Our work suggests that, while proving performance gains for SSL algorithms is possible, it requires careful tracking of constants.
arXiv Detail & Related papers (2023-11-30T13:48:50Z) - Progressive Feature Adjustment for Semi-supervised Learning from
Pretrained Models [39.42802115580677]
Semi-supervised learning (SSL) can leverage both labeled and unlabeled data to build a predictive model.
Recent literature suggests that naively applying state-of-the-art SSL with a pretrained model fails to unleash the full potential of training data.
We propose to use pseudo-labels from the unlabelled data to update the feature extractor that is less sensitive to incorrect labels.
arXiv Detail & Related papers (2023-09-09T01:57:14Z) - Improving Open-Set Semi-Supervised Learning with Self-Supervision [13.944469874692459]
Open-set semi-supervised learning (OSSL) embodies a practical scenario within semi-supervised learning.
We propose an OSSL framework that facilitates learning from all unlabeled data through self-supervision.
Our method yields state-of-the-art results on many of the evaluated benchmark problems.
arXiv Detail & Related papers (2023-01-24T16:46:37Z) - Semi-Leak: Membership Inference Attacks Against Semi-supervised Learning [42.089020844936805]
Semi-supervised learning (SSL) leverages both labeled and unlabeled data to train machine learning (ML) models.
We propose the first data augmentation-based membership inference attacks against ML models trained by SSL.
Our evaluation shows that the proposed attack can consistently outperform existing membership inference attacks.
arXiv Detail & Related papers (2022-07-25T21:17:24Z) - OpenLDN: Learning to Discover Novel Classes for Open-World
Semi-Supervised Learning [110.40285771431687]
Semi-supervised learning (SSL) is one of the dominant approaches to address the annotation bottleneck of supervised learning.
Recent SSL methods can effectively leverage a large repository of unlabeled data to improve performance while relying on a small set of labeled data.
This work introduces OpenLDN that utilizes a pairwise similarity loss to discover novel classes.
arXiv Detail & Related papers (2022-07-05T18:51:05Z) - Collaborative Intelligence Orchestration: Inconsistency-Based Fusion of
Semi-Supervised Learning and Active Learning [60.26659373318915]
Active learning (AL) and semi-supervised learning (SSL) are two effective, but often isolated, means to alleviate the data-hungry problem.
We propose an innovative Inconsistency-based virtual aDvErial algorithm to further investigate SSL-AL's potential superiority.
Two real-world case studies visualize the practical industrial value of applying and deploying the proposed data sampling algorithm.
arXiv Detail & Related papers (2022-06-07T13:28:43Z) - Self-supervised Learning is More Robust to Dataset Imbalance [65.84339596595383]
We investigate self-supervised learning under dataset imbalance.
Off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations.
We devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets.
arXiv Detail & Related papers (2021-10-11T06:29:56Z) - Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for
Open-Set Semi-Supervised Learning [101.28281124670647]
Open-set semi-supervised learning (open-set SSL) investigates a challenging but practical scenario where out-of-distribution (OOD) samples are contained in the unlabeled data.
We propose a novel training mechanism that could effectively exploit the presence of OOD data for enhanced feature learning.
Our approach substantially lifts the performance on open-set SSL and outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2021-08-12T09:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.