HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data
- URL: http://arxiv.org/abs/2506.11152v2
- Date: Thu, 25 Sep 2025 19:18:59 GMT
- Title: HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data
- Authors: Hiren Madhu, João Felipe Rocha, Tinglin Huang, Siddharth Viswanath, Smita Krishnaswamy, Rex Ying,
- Abstract summary: We introduce HEIST, a hierarchical graph foundation model for spatial transcriptomics and transformer.<n>HEIST is pretrained on 22.3M cells from 124 tissues across 15 organs using spatially-aware contrastive and masked autoencoding objectives.
- Score: 25.915980581662023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Single-cell transcriptomics and proteomics have become a great source for data-driven insights into biology, enabling the use of advanced deep learning methods to understand cellular heterogeneity and gene expression at the single-cell level. With the advent of spatial-omics data, we have the promise of characterizing cells within their tissue context as it provides both spatial coordinates and intra-cellular transcriptional or protein counts. Proteomics offers a complementary view by directly measuring proteins, which are the primary effectors of cellular function and key therapeutic targets. However, existing models either ignore the spatial information or the complex genetic and proteomic programs within cells. Thus they cannot infer how cell internal regulation adapts to microenvironmental cues. Furthermore, these models often utilize fixed gene vocabularies, hindering their generalizability unseen genes. In this paper, we introduce HEIST, a hierarchical graph transformer foundation model for spatial transcriptomics and proteomics. HEIST models tissues as hierarchical graphs. The higher level graph is a spatial cell graph, and each cell in turn, is represented by its lower level gene co-expression network graph. HEIST achieves this by performing both intra-level and cross-level message passing to utilize the hierarchy in its embeddings and can thus generalize to novel datatypes including spatial proteomics without retraining. HEIST is pretrained on 22.3M cells from 124 tissues across 15 organs using spatially-aware contrastive and masked autoencoding objectives. Unsupervised analysis of HEIST embeddings reveals spatially informed subpopulations missed by prior models. Downstream evaluations demonstrate generalizability to proteomics data and state-of-the-art performance in clinical outcome prediction, cell type annotation, and gene imputation across multiple technologies.
Related papers
- Uncovering spatial tissue domains and cell types in spatial omics through cross-scale profiling of cellular and genomic interactions [26.7111709393529]
We present CellScape, a deep learning framework designed to overcome limitations for high-performance spatial transcriptomics analysis.<n>CellScape models cellular interactions in tissue space and genomic relationships among cells, producing comprehensive representations.<n>This technique uncovers biologically informative patterns that improve spatial domain segmentation.
arXiv Detail & Related papers (2026-02-13T06:22:43Z) - Contrastive Learning Enhances Language Model Based Cell Embeddings for Low-Sample Single Cell Transcriptomics [3.7907528918903797]
Large language models (LLMs) have shown ability in generating rich representations across domains such as natural language processing and generation, computer vision, and multimodal learning.<n>We present a computational framework that integrates single-cell RNA sequencing (scRNA-seq) with LLMs to derive knowledge-informed gene embeddings.
arXiv Detail & Related papers (2025-09-28T00:45:39Z) - SPATIA: Multimodal Model for Prediction and Generation of Spatial Cell Phenotypes [39.45743286683448]
We introduce SPATIA, a multi-scale generative and predictive model for spatial transcriptomics.<n> SPATIA learns cell-level embeddings by fusing image-derived morphological tokens and transcriptomic vector tokens.<n>We benchmark SPATIA against 13 existing models across 12 individual tasks.
arXiv Detail & Related papers (2025-07-07T06:54:02Z) - GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z) - OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Joint LLM and GNN Modeling [14.455616582960557]
We introduce OmniCellTOSG, the first dataset of cell text-omic signaling graphs (TOSGs)<n>Each TOSG represents the signaling network of an individual or meta-cell and is labeled with information such as organ, disease, sex, age, and cell subtype.<n>The dataset is continuously expanding and will be updated regularly.
arXiv Detail & Related papers (2025-04-02T21:47:58Z) - A scalable gene network model of regulatory dynamics in single cells [88.48246132084441]
We introduce a Functional Learnable model of Cell dynamicS, FLeCS, that incorporates gene network structure into coupled differential equations to model gene regulatory functions.<n>Given (pseudo)time-series single-cell data, FLeCS accurately infers cell dynamics at scale.
arXiv Detail & Related papers (2025-03-25T19:19:21Z) - HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification [0.19791587637442667]
This study introduces a novel single-stage approach for generating image-label pairs to augment histology datasets.<n>Unlike state-of-the-art methods that utilize diffusion models with separate components for label and image generation, our approach employs a latent diffusion model.<n>This model enables tailored data generation by conditioning on user-defined parameters such as cell types, quantities, and tissue types.
arXiv Detail & Related papers (2025-02-12T19:51:41Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Cell-ontology guided transcriptome foundation model [18.51941953027685]
We pre-trained scCello on 22 million cells from CellxGene database leveraging their cell-type labels mapped to the cell ontology graph from Open Biological and Biomedical Ontology Foundry.<n>Our TFM demonstrates competitive generalization and transferability performance over the existing TFMs on biologically important tasks.
arXiv Detail & Related papers (2024-08-22T13:15:49Z) - Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen [76.02070962797794]
This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data.<n>CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - scBiGNN: Bilevel Graph Representation Learning for Cell Type
Classification from Single-cell RNA Sequencing Data [62.87454293046843]
Graph neural networks (GNNs) have been widely used for automatic cell type classification.
scBiGNN comprises two GNN modules to identify cell types.
scBiGNN outperforms a variety of existing methods for cell type classification from scRNA-seq data.
arXiv Detail & Related papers (2023-12-16T03:54:26Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Revealing Cortical Layers In Histological Brain Images With
Self-Supervised Graph Convolutional Networks Applied To Cell-Graphs [0.20971479389679332]
We introduce a self-supervised approach to detect layers in 2D Nissl-stained histological slices of the cerebral cortex.
A self-supervised graph convolutional network generates cell embeddings that encode morphological and structural traits of the cellular environment.
arXiv Detail & Related papers (2023-11-26T10:33:36Z) - Tertiary Lymphoid Structures Generation through Graph-based Diffusion [54.37503714313661]
In this work, we leverage state-of-the-art graph-based diffusion models to generate biologically meaningful cell-graphs.
We show that the adopted graph diffusion model is able to accurately learn the distribution of cells in terms of their tertiary lymphoid structures (TLS) content.
arXiv Detail & Related papers (2023-10-10T14:37:17Z) - Topology-Guided Multi-Class Cell Context Generation for Digital
Pathology [28.43244574309888]
We introduce several mathematical tools from spatial statistics and topological data analysis.
We generate high quality multi-class cell layouts for the first time.
We show that the topology-rich cell layouts can be used for data augmentation and improve the performance of downstream tasks such as cell classification.
arXiv Detail & Related papers (2023-04-05T07:01:34Z) - Self-Supervised Graph Representation Learning for Neuronal Morphologies [75.38832711445421]
We present GraphDINO, a data-driven approach to learn low-dimensional representations of 3D neuronal morphologies from unlabeled datasets.
We show, in two different species and across multiple brain areas, that this method yields morphological cell type clusterings on par with manual feature-based classification by experts.
Our method could potentially enable data-driven discovery of novel morphological features and cell types in large-scale datasets.
arXiv Detail & Related papers (2021-12-23T12:17:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.