Self-supervised learning of multi-omics embeddings in the low-label,
high-data regime
- URL: http://arxiv.org/abs/2311.09962v1
- Date: Thu, 16 Nov 2023 15:32:22 GMT
- Title: Self-supervised learning of multi-omics embeddings in the low-label,
high-data regime
- Authors: Christian John Hurry, Emma Slade
- Abstract summary: Contrastive, self-supervised learning (SSL) is used to train a model that predicts cancer type from unimodal, mRNA or RPPA expression data.
A late-fusion model is proposed, where each omics is passed through its own sub-network, the outputs of which are averaged and passed to the pretraining or downstream objective function.
Multi-modal pretraining is shown to improve predictions from a single omics, and we argue that this is useful for datasets with many unlabelled multi-modal samples, but few labelled samples.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Contrastive, self-supervised learning (SSL) is used to train a model that
predicts cancer type from miRNA, mRNA or RPPA expression data. This model, a
pretrained FT-Transformer, is shown to outperform XGBoost and CatBoost,
standard benchmarks for tabular data, when labelled samples are scarce but the
number of unlabelled samples is high. This is despite the fact that the
datasets we use have $\mathcal{O}(10^{1})$ classes and
$\mathcal{O}(10^{2})-\mathcal{O}(10^{4})$ features. After demonstrating the
efficacy of our chosen method of self-supervised pretraining, we investigate
SSL for multi-modal models. A late-fusion model is proposed, where each omics
is passed through its own sub-network, the outputs of which are averaged and
passed to the pretraining or downstream objective function. Multi-modal
pretraining is shown to improve predictions from a single omics, and we argue
that this is useful for datasets with many unlabelled multi-modal samples, but
few labelled unimodal samples. Additionally, we show that pretraining each
omics-specific module individually is highly effective. This enables the
application of the proposed model in a variety of contexts where a large amount
of unlabelled data is available from each omics, but only a few labelled
samples.
Related papers
- Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification [49.09505771145326]
We propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels.
Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.
arXiv Detail & Related papers (2024-04-26T06:00:27Z) - No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance [68.18779562801762]
multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance.
Our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
arXiv Detail & Related papers (2024-04-04T17:58:02Z) - How Low Can You Go? Surfacing Prototypical In-Distribution Samples for Unsupervised Anomaly Detection [48.30283806131551]
We show that UAD with extremely few training samples can already match -- and in some cases even surpass -- the performance of training with the whole training dataset.
We propose an unsupervised method to reliably identify prototypical samples to further boost UAD performance.
arXiv Detail & Related papers (2023-12-06T15:30:47Z) - Generative Semi-supervised Learning with Meta-Optimized Synthetic
Samples [5.384630221560811]
Semi-supervised learning (SSL) is a promising approach for training deep classification models using labeled and unlabeled datasets.
In this paper, we investigate the research question: Can we train SSL models without real unlabeled datasets?
We propose an SSL method using synthetic datasets generated from generative foundation models trained on datasets containing millions of samples in diverse domains.
arXiv Detail & Related papers (2023-09-28T03:47:26Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - ScatterSample: Diversified Label Sampling for Data Efficient Graph
Neural Network Learning [22.278779277115234]
In some applications where graph neural network (GNN) training is expensive, labeling new instances is expensive.
We develop a data-efficient active sampling framework, ScatterSample, to train GNNs under an active learning setting.
Our experiments on five datasets show that ScatterSample significantly outperforms the other GNN active learning baselines.
arXiv Detail & Related papers (2022-06-09T04:05:02Z) - A Data Cartography based MixUp for Pre-trained Language Models [47.90235939359225]
MixUp is a data augmentation strategy where additional samples are generated during training by combining random pairs of training samples and their labels.
We propose TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples.
We empirically validate that our method not only achieves competitive performance using a smaller subset of the training data compared with strong baselines, but also yields lower expected calibration error on the pre-trained language model, BERT, on both in-domain and out-of-domain settings in a wide range of NLP tasks.
arXiv Detail & Related papers (2022-05-06T17:59:19Z) - Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint
Localization [88.74813798138466]
Localizing keypoints of an object is a basic visual problem.
Supervised learning of a keypoint localization network often requires a large amount of data.
We propose to automatically select reliable pseudo-labeled samples with a series of dynamic thresholds.
arXiv Detail & Related papers (2022-01-21T09:51:58Z) - Multi-Task Curriculum Framework for Open-Set Semi-Supervised Learning [54.85397562961903]
Semi-supervised learning (SSL) has been proposed to leverage unlabeled data for training powerful models when only limited labeled data is available.
We address a more complex novel scenario named open-set SSL, where out-of-distribution (OOD) samples are contained in unlabeled data.
Our method achieves state-of-the-art results by successfully eliminating the effect of OOD samples.
arXiv Detail & Related papers (2020-07-22T10:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.