Related papers: What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses

What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses

URL: http://arxiv.org/abs/2602.22289v1
Date: Wed, 25 Feb 2026 14:33:24 GMT
Title: What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses
Authors: Ihor Kendiukhov,
Abstract summary: We propose an AI-driven-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations.<n>Gene embedding neighborhoods exhibit non-trivial topology, with persistent homology significant in 11 of 12 transformer layers.<n> CCA alignment between scGPT and Geneformer yields canonical correlation of 0.80 and gene retrieval accuracy of 72 percent, yet none of 19 tested methods reliably recover gene-level correspondences.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When biological foundation models such as scGPT and Geneformer process single-cell gene expression, what geometric and topological structure forms in their internal representations? Is that structure biologically meaningful or a training artifact, and how confident should we be in such claims? We address these questions through autonomous large-scale hypothesis screening: an AI-driven executor-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations, covering persistent homology, manifold distances, cross-model alignment, community structure, and directed topology, all with explicit null controls and disjoint gene-pool splits. Three principal findings emerge. First, the models learn genuine geometric structure. Gene embedding neighborhoods exhibit non-trivial topology, with persistent homology significant in 11 of 12 transformer layers at p < 0.05 in the weakest domain and 12 of 12 in the other two. A multi-level distance hierarchy shows that manifold-aware metrics outperform Euclidean distance for identifying regulatory gene pairs, and graph community partitions track known transcription factor target relationships. Second, this structure is shared across independently trained models. CCA alignment between scGPT and Geneformer yields canonical correlation of 0.80 and gene retrieval accuracy of 72 percent, yet none of 19 tested methods reliably recover gene-level correspondences. The models agree on the global shape of gene space but not on precise gene placement. Third, the structure is more localized than it first appears. Under stringent null controls applied across all null families, robust signal concentrates in immune tissue, while lung and external lung signals weaken substantially.

Related papers

Sparse autoencoders reveal organized biological knowledge but minimal regulatory logic in single-cell foundation models: a comparative atlas of Geneformer and scGPT [0.0]
Single-cell foundation models Geneformer and scGPT encode rich biological information.<n>We trained TopK SAEs on residual stream activations from all layers of Geneformer V2-316M and scGPT whole-human.<n>We release both feature atlases as interactive web platforms enabling exploration of more than 107000 features across 30 layers of two leading single-cell foundation models.
arXiv Detail & Related papers (2026-03-03T13:05:11Z)
Causal Circuit Tracing Reveals Distinct Computational Architectures in Single-Cell Foundation Models: Inhibitory Dominance, Biological Coherence, and Cross-Model Convergence [0.0]
We introduce causal circuit tracing by ablating SAE features and downstream responses.<n>We apply it to Geneformer V2-316M and scGPT whole-human across four conditions.
arXiv Detail & Related papers (2026-03-02T11:21:44Z)
Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations [0.0]
Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear.<n>We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening.<n>Results indicate that biological transformers learn an interpretable internal model of cellular organization.
arXiv Detail & Related papers (2026-02-24T17:57:59Z)
Spatially Gene Expression Prediction using Dual-Scale Contrastive Learning [12.35331063443348]
NH2ST integrates spatial context and both pathology and gene modalities for gene expression prediction.<n>Our model consistently outperforms existing methods, achieving over 20% in PCC metrics.
arXiv Detail & Related papers (2025-06-30T13:18:39Z)
Multi-omic Causal Discovery using Genotypes and Gene Expression [0.0]
We introduce GENESIS, a constraint-based causal algorithm to infer ancestral relationships in transcriptomic data.<n>By integrating genotypes as fixed causal anchors, GENESIS provides a principled head start'' to classical causal discovery algorithms.<n>This framework offers a powerful avenue for uncovering causal pathways in complex traits, with promising applications to functional genomics, drug discovery, and precision medicine.
arXiv Detail & Related papers (2025-05-21T11:52:23Z)
GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation [50.80441546742053]
Phylogenetic trees elucidate evolutionary relationships among species.<n>Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens.<n>We propose PhyloGen, a novel method leveraging a pre-trained genomic language model.
arXiv Detail & Related papers (2024-12-25T08:33:05Z)
$\Gamma$-VAE: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional data [0.25128687379089687]
Natural systems with emergent behaviors often organize along low-dimensional subsets of high-dimensional spaces. We show that regularizing the curvature of generative models will enable more consistent, predictive, and generalizable models.
arXiv Detail & Related papers (2024-03-02T03:26:09Z)
PhyloGFN: Phylogenetic inference with generative flow networks [57.104166650526416]
We introduce the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and phylogenetic inference. Because GFlowNets are well-suited for sampling complex structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets.
arXiv Detail & Related papers (2023-10-12T23:46:08Z)
DynGFN: Towards Bayesian Inference of Gene Regulatory Networks with GFlowNets [81.75973217676986]
Gene regulatory networks (GRN) describe interactions between genes and their products that control gene expression and cellular function. Existing methods either focus on challenge (1), identifying cyclic structure from dynamics, or on challenge (2) learning complex Bayesian posteriors over DAGs, but not both. In this paper we leverage the fact that it is possible to estimate the "velocity" of gene expression with RNA velocity techniques to develop an approach that addresses both challenges.
arXiv Detail & Related papers (2023-02-08T16:36:40Z)
Unsupervised ensemble-based phenotyping helps enhance the discoverability of genes related to heart morphology [57.25098075813054]
We propose a new framework for gene discovery entitled Un Phenotype Ensembles. It builds a redundant yet highly expressive representation by pooling a set of phenotypes learned in an unsupervised manner. These phenotypes are then analyzed via (GWAS), retaining only highly confident and stable associations.
arXiv Detail & Related papers (2023-01-07T18:36:44Z)
An Integrated Deep Learning and Dynamic Programming Method for Predicting Tumor Suppressor Genes, Oncogenes, and Fusion from PDB Structures [0.0]
Mutations in proto-oncogenes (ONGO) and the loss of regulatory function of tumor suppression genes (TSG) are the common underlying mechanism for uncontrolled tumor growth. Finding the potentiality of the genes related functionality to ONGO or TSG through computational studies can help develop drugs that target the disease. This paper proposes a classification method that starts with a preprocessing stage to extract the feature map sets from the input 3D protein structural information.
arXiv Detail & Related papers (2021-05-17T18:18:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.