Related papers: Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

URL: http://arxiv.org/abs/2602.13944v1
Date: Sun, 15 Feb 2026 00:59:13 GMT
Title: Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology
Authors: Minghao Han, Dingkang Yang, Linhao Qu, Zizhi Chen, Gang Li, Han Wang, Jiacong Wang, Lihua Zhang,
Abstract summary: STAMP is a spatial transcriptomics-augmented multimodal pathology representation learning framework.<n>Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations.<n>We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance.
Score: 46.83014413674925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms, STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M are available at: https://github.com/Hanminghao/STAMP.

Related papers

MMAP: A Multi-Magnification and Prototype-Aware Architecture for Predicting Spatial Gene Expression [1.083137038945176]
Spatial Transcriptomics (ST) enables the measurement of gene expression while preserving spatial information.<n>Recent developments have explored the use of hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) to predict transcriptome-wide gene expression profiles through deep neural networks.<n>However, predicting spatial gene expression from histological images remains a challenging problem due to the significant modality gap between visual features and molecular signals.<n>In this work, we propose a novel framework, MMAP (Multi-MAgnification and Prototype-enhanced architecture), that addresses both challenges simultaneously.
arXiv Detail & Related papers (2025-10-13T12:41:09Z)
SPADE: Spatial Transcriptomics and Pathology Alignment Using a Mixture of Data Experts for an Expressive Latent Space [19.44068776198238]
We introduce SPADE, a foundation model that integrates histopathology with spatial transcriptomics data to guide image representation learning.<n> SPADE is evaluated on 20 downstream tasks, demonstrating significantly superior few-shot performance compared to baseline models.
arXiv Detail & Related papers (2025-06-27T02:20:51Z)
MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention [57.044719143401664]
Histopathology and transcriptomics are fundamental modalities in oncology, encapsulating the morphological and molecular aspects of the disease.<n>We present MIRROR, a novel multi-modal representation learning method designed to foster both modality alignment and retention.<n>Extensive evaluations on TCGA cohorts for cancer subtyping and survival analysis highlight MIRROR's superior performance.
arXiv Detail & Related papers (2025-03-01T07:02:30Z)
Towards Unified Molecule-Enhanced Pathology Image Representation Learning via Integrating Spatial Transcriptomics [10.962389869127875]
We introduce a Unified Molecule-enhanced Pathology Image REpresentationn Learning framework (UMPIRE)<n>UMPIRE aims to leverage complementary information from gene expression profiles to guide the multimodal pre-training.<n>We demonstrate that this molecular perspective provides a robust, task-agnostic training signal for learning pathology image embeddings.
arXiv Detail & Related papers (2024-12-01T03:09:52Z)
Multi-modal Spatial Clustering for Spatial Transcriptomics Utilizing High-resolution Histology Images [1.3124513975412255]
spatial transcriptomics (ST) enables transcriptome-wide gene expression profiling while preserving spatial context. Current spatial clustering methods fail to fully integrate high-resolution histology image features with gene expression data. We propose a novel contrastive learning-based deep learning approach that integrates gene expression data with histology image features.
arXiv Detail & Related papers (2024-10-31T00:32:24Z)
Multimodal contrastive learning for spatial gene expression prediction using histology images [13.47034080678041]
We propose textbfmclSTExp, a multimodal contrastive learning with Transformer and Densenet-121 encoder for Spatial Transcriptomics Expression prediction. textbfmclSTExp has superior performance in predicting spatial gene expression. It has shown promise in interpreting cancer-specific overexpressed genes, elucidating immune-related genes, and identifying specialized spatial domains annotated by pathologists.
arXiv Detail & Related papers (2024-07-11T06:33:38Z)
Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z)
PCRLv2: A Unified Visual Information Preservation Framework for Self-supervised Pre-training in Medical Image Analysis [56.63327669853693]
We propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics. We also address the preservation of scale information, a powerful tool in aiding image understanding. The proposed unified SSL framework surpasses its self-supervised counterparts on various tasks.
arXiv Detail & Related papers (2023-01-02T17:47:27Z)
G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers. We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z)
Reprogramming Language Models for Molecular Representation Learning [65.00999660425731]
We propose Representation Reprogramming via Dictionary Learning (R2DL) for adversarially reprogramming pretrained language models for molecular learning tasks. The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver. R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting.
arXiv Detail & Related papers (2020-12-07T05:50:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.