Generative Semi-supervised Learning with Meta-Optimized Synthetic
Samples
- URL: http://arxiv.org/abs/2309.16143v1
- Date: Thu, 28 Sep 2023 03:47:26 GMT
- Title: Generative Semi-supervised Learning with Meta-Optimized Synthetic
Samples
- Authors: Shin'ya Yamaguchi
- Abstract summary: Semi-supervised learning (SSL) is a promising approach for training deep classification models using labeled and unlabeled datasets.
In this paper, we investigate the research question: Can we train SSL models without real unlabeled datasets?
We propose an SSL method using synthetic datasets generated from generative foundation models trained on datasets containing millions of samples in diverse domains.
- Score: 5.384630221560811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semi-supervised learning (SSL) is a promising approach for training deep
classification models using labeled and unlabeled datasets. However, existing
SSL methods rely on a large unlabeled dataset, which may not always be
available in many real-world applications due to legal constraints (e.g.,
GDPR). In this paper, we investigate the research question: Can we train SSL
models without real unlabeled datasets? Instead of using real unlabeled
datasets, we propose an SSL method using synthetic datasets generated from
generative foundation models trained on datasets containing millions of samples
in diverse domains (e.g., ImageNet). Our main concepts are identifying
synthetic samples that emulate unlabeled samples from generative foundation
models and training classifiers using these synthetic samples. To achieve this,
our method is formulated as an alternating optimization problem: (i)
meta-learning of generative foundation models and (ii) SSL of classifiers using
real labeled and synthetic unlabeled samples. For (i), we propose a
meta-learning objective that optimizes latent variables to generate samples
that resemble real labeled samples and minimize the validation loss. For (ii),
we propose a simple unsupervised loss function that regularizes the feature
extractors of classifiers to maximize the performance improvement obtained from
synthetic samples. We confirm that our method outperforms baselines using
generative foundation models on SSL. We also demonstrate that our methods
outperform SSL using real unlabeled datasets in scenarios with extremely small
amounts of labeled datasets. This suggests that synthetic samples have the
potential to provide improvement gains more efficiently than real unlabeled
data.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Self-supervised learning of multi-omics embeddings in the low-label,
high-data regime [0.0]
Contrastive, self-supervised learning (SSL) is used to train a model that predicts cancer type from unimodal, mRNA or RPPA expression data.
A late-fusion model is proposed, where each omics is passed through its own sub-network, the outputs of which are averaged and passed to the pretraining or downstream objective function.
Multi-modal pretraining is shown to improve predictions from a single omics, and we argue that this is useful for datasets with many unlabelled multi-modal samples, but few labelled samples.
arXiv Detail & Related papers (2023-11-16T15:32:22Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Pseudo-Labeling Based Practical Semi-Supervised Meta-Training for Few-Shot Learning [93.63638405586354]
We propose a simple and effective meta-training framework, called pseudo-labeling based meta-learning (PLML)
Firstly, we train a classifier via common semi-supervised learning (SSL) and use it to obtain the pseudo-labels of unlabeled data.
We build few-shot tasks from labeled and pseudo-labeled data and design a novel finetuning method with feature smoothing and noise suppression.
arXiv Detail & Related papers (2022-07-14T10:53:53Z) - Collaborative Intelligence Orchestration: Inconsistency-Based Fusion of
Semi-Supervised Learning and Active Learning [60.26659373318915]
Active learning (AL) and semi-supervised learning (SSL) are two effective, but often isolated, means to alleviate the data-hungry problem.
We propose an innovative Inconsistency-based virtual aDvErial algorithm to further investigate SSL-AL's potential superiority.
Two real-world case studies visualize the practical industrial value of applying and deploying the proposed data sampling algorithm.
arXiv Detail & Related papers (2022-06-07T13:28:43Z) - Multi-Task Curriculum Framework for Open-Set Semi-Supervised Learning [54.85397562961903]
Semi-supervised learning (SSL) has been proposed to leverage unlabeled data for training powerful models when only limited labeled data is available.
We address a more complex novel scenario named open-set SSL, where out-of-distribution (OOD) samples are contained in unlabeled data.
Our method achieves state-of-the-art results by successfully eliminating the effect of OOD samples.
arXiv Detail & Related papers (2020-07-22T10:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.