Self-supervised learning on gene expression data
- URL: http://arxiv.org/abs/2507.13912v1
- Date: Fri, 18 Jul 2025 13:43:04 GMT
- Title: Self-supervised learning on gene expression data
- Authors: Kevin Dradjat, Massinissa Hamidi, Pierre Bartet, Blaise Hanczar,
- Abstract summary: Predicting phenotypes from gene expression data is a crucial task in biomedical research, enabling insights into disease mechanisms, drug responses, and personalized medicine.<n>Traditional machine learning and deep learning rely on supervised learning, which requires large quantities of labeled data.<n>Self-supervised learning has emerged as a promising approach to overcome these limitations by extracting information directly from the structure of unlabeled data.
- Score: 3.8623569699070353
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Predicting phenotypes from gene expression data is a crucial task in biomedical research, enabling insights into disease mechanisms, drug responses, and personalized medicine. Traditional machine learning and deep learning rely on supervised learning, which requires large quantities of labeled data that are costly and time-consuming to obtain in the case of gene expression data. Self-supervised learning has recently emerged as a promising approach to overcome these limitations by extracting information directly from the structure of unlabeled data. In this study, we investigate the application of state-of-the-art self-supervised learning methods to bulk gene expression data for phenotype prediction. We selected three self-supervised methods, based on different approaches, to assess their ability to exploit the inherent structure of the data and to generate qualitative representations which can be used for downstream predictive tasks. By using several publicly available gene expression datasets, we demonstrate how the selected methods can effectively capture complex information and improve phenotype prediction accuracy. The results obtained show that self-supervised learning methods can outperform traditional supervised models besides offering significant advantage by reducing the dependency on annotated data. We provide a comprehensive analysis of the performance of each method by highlighting their strengths and limitations. We also provide recommendations for using these methods depending on the case under study. Finally, we outline future research directions to enhance the application of self-supervised learning in the field of gene expression data analysis. This study is the first work that deals with bulk RNA-Seq data and self-supervised learning.
Related papers
- Robust Molecular Property Prediction via Densifying Scarce Labeled Data [51.55434084913129]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel meta-learning-based approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.<n>We demonstrate significant performance gains on challenging real-world datasets.
arXiv Detail & Related papers (2025-06-13T15:27:40Z) - Multi-dataset and Transfer Learning Using Gene Expression Knowledge Graphs [1.8722948221596285]
Gene expression datasets offer insights into gene regulation mechanisms, biochemical pathways, and cellular functions.<n>Gene expression data can provide valuable insights, but challenges arise because the number of patients in expression datasets is limited.<n>This work proposes a novel methodology to address these challenges by integrating multiple gene expression datasets and domain-specific knowledge.
arXiv Detail & Related papers (2025-03-26T10:23:27Z) - Detecting Dataset Bias in Medical AI: A Generalized and Modality-Agnostic Auditing Framework [8.017827642932746]
Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT) for datasets is a modality-agnostic dataset auditing framework.<n>Our method examines the relationship between task-level annotations and data properties including patient attributes.<n>G-AUDIT successfully identifies subtle biases commonly overlooked by traditional qualitative methods.
arXiv Detail & Related papers (2025-03-13T02:16:48Z) - Meta-Learning on Augmented Gene Expression Profiles for Enhanced Lung Cancer Detection [3.7929238927240685]
We present a meta-learning-based approach for predicting lung cancer from gene expression profiles.
We employ four distinct datasets for the meta-learning tasks, where one as the target dataset and the rest as source datasets.
Results show the superior performance of meta-learning on augmented source data compared to the baselines trained on single datasets.
arXiv Detail & Related papers (2024-08-19T01:39:12Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Towards Biologically Plausible and Private Gene Expression Data
Generation [47.72947816788821]
Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications.
Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions.
We initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data.
arXiv Detail & Related papers (2024-02-07T14:39:11Z) - Drug Synergistic Combinations Predictions via Large-Scale Pre-Training
and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation.
Deep learning models have emerged as an efficient way to discover synergistic combinations.
Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z) - Towards Open-World Feature Extrapolation: An Inductive Graph Learning
Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning.
Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z) - TRAPDOOR: Repurposing backdoors to detect dataset bias in machine
learning-based genomic analysis [15.483078145498085]
Under-representation of groups in datasets can lead to inaccurate predictions for certain groups, which can exacerbate systemic discrimination issues.
We propose TRAPDOOR, a methodology for identification of biased datasets by repurposing a technique that has been mostly proposed for nefarious purposes: Neural network backdoors.
Using a real-world cancer dataset, we analyze the dataset with the bias that already existed towards white individuals and also introduced biases in datasets artificially.
arXiv Detail & Related papers (2021-08-14T17:02:02Z) - Predicting Potential Drug Targets Using Tensor Factorisation and
Knowledge Graph Embeddings [4.415977307120617]
We have developed a new tensor factorisation model to predict potential drug targets (i.e.,genes or proteins) for diseases.
We enriched the data with gene representations learned from a drug discovery-oriented knowledge graph and applied our proposed method to predict the clinical outcomes for unseen target and dis-ease pairs.
arXiv Detail & Related papers (2021-05-20T16:19:00Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.