Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking
- URL: http://arxiv.org/abs/2505.02980v1
- Date: Mon, 05 May 2025 19:17:29 GMT
- Title: Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking
- Authors: Daniela Ruiz, Paula Cardenas, Leonardo Manrique, Daniela Vega, Gabriel Mejia, Pablo Arbelaez,
- Abstract summary: We introduce SpaRED, a database comprising 26 public datasets, and SpaCKLE, a state-of-the-art transformer-based gene expression completion model.<n>Our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date.
- Score: 1.177642303362119
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Spatial Transcriptomics is a groundbreaking technology that integrates histology images with spatially resolved gene expression profiles. Among the various Spatial Transcriptomics techniques available, Visium has emerged as the most widely adopted. However, its accessibility is limited by high costs, the need for specialized expertise, and slow clinical integration. Additionally, gene capture inefficiencies lead to significant dropout, corrupting acquired data. To address these challenges, the deep learning community has explored the gene expression prediction task directly from histology images. Yet, inconsistencies in datasets, preprocessing, and training protocols hinder fair comparisons between models. To bridge this gap, we introduce SpaRED, a systematically curated database comprising 26 public datasets, providing a standardized resource for model evaluation. We further propose SpaCKLE, a state-of-the-art transformer-based gene expression completion model that reduces mean squared error by over 82.5% compared to existing approaches. Finally, we establish the SpaRED benchmark, evaluating eight state-of-the-art prediction models on both raw and SpaCKLE-completed data, demonstrating SpaCKLE substantially improves the results across all the gene expression prediction models. Altogether, our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date and a stepping stone for future research on Spatial Transcriptomics.
Related papers
- A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics [2.3070195554676993]
HESCAPE is a benchmark for evaluating cross-modal contrastive pretraining in spatial transcriptomics.<n>Gene models pretrained on spatial transcriptomics data outperform both those trained without spatial data and simple baseline approaches.<n>We identify batch effects as a key factor that interferes with effective cross-modal alignment.
arXiv Detail & Related papers (2025-08-02T21:11:36Z) - PhenoKG: Knowledge Graph-Driven Gene Discovery and Patient Insights from Phenotypes Alone [40.61937241424789]
We propose a graph-based approach for predicting causative genes from patient phenotypes, with or without an available list of candidate genes.<n>Our model, combining graph neural networks and transformers, achieves substantial improvements over the current state-of-the-art.
arXiv Detail & Related papers (2025-06-16T05:54:12Z) - Robust Molecular Property Prediction via Densifying Scarce Labeled Data [51.55434084913129]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel meta-learning-based approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.<n>We demonstrate significant performance gains on challenging real-world datasets.
arXiv Detail & Related papers (2025-06-13T15:27:40Z) - Teaching pathology foundation models to accurately predict gene expression with parameter efficient knowledge transfer [1.5416321520529301]
Efficient Knowledge Adaptation (PEKA) is a novel framework that integrates knowledge distillation and structure alignment losses for cross-modal knowledge transfer.<n>We evaluated PEKA for gene expression prediction using multiple spatial transcriptomics datasets.
arXiv Detail & Related papers (2025-04-09T17:24:41Z) - Continually Evolved Multimodal Foundation Models for Cancer Prognosis [50.43145292874533]
Cancer prognosis is a critical task that involves predicting patient outcomes and survival rates.<n>Previous studies have integrated diverse data modalities, such as clinical notes, medical images, and genomic data, leveraging their complementary information.<n>Existing approaches face two major limitations. First, they struggle to incorporate newly arrived data with varying distributions into training, such as patient records from different hospitals.<n>Second, most multimodal integration methods rely on simplistic concatenation or task-specific pipelines, which fail to capture the complex interdependencies across modalities.
arXiv Detail & Related papers (2025-01-30T06:49:57Z) - SpaRED benchmark: Enhancing Gene Expression Prediction from Histology Images with Spatial Transcriptomics Completion [2.032350440475489]
We present a systematically curated and processed database collected from 26 public sources.
We also propose a state-of-the-art transformer based completion technique for inferring missing gene expression.
Our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date.
arXiv Detail & Related papers (2024-07-17T21:28:20Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - Genetic InfoMax: Exploring Mutual Information Maximization in
High-Dimensional Imaging Genetics Studies [50.11449968854487]
Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits.
Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS.
We introduce a trans-modal learning framework Genetic InfoMax (GIM) to address the specific challenges of GWAS.
arXiv Detail & Related papers (2023-09-26T03:59:21Z) - SEPAL: Spatial Gene Expression Prediction from Local Graphs [1.4523812806185954]
We present SEPAL, a new model for predicting genetic profiles from visual tissue appearance.
Our method exploits the biological biases of the problem by directly supervising relative differences with respect to mean expression.
We propose a novel benchmark that aims to better define the task by following current best practices in transcriptomics.
arXiv Detail & Related papers (2023-09-02T23:24:02Z) - CausalBench: A Large-scale Benchmark for Network Inference from
Single-cell Perturbation Data [61.088705993848606]
We introduce CausalBench, a benchmark suite for evaluating causal inference methods on real-world interventional data.
CaulBench incorporates biologically-motivated performance metrics, including new distribution-based interventional metrics.
arXiv Detail & Related papers (2022-10-31T13:04:07Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.