Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations
- URL: http://arxiv.org/abs/2602.22247v1
- Date: Tue, 24 Feb 2026 17:57:59 GMT
- Title: Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations
- Authors: Ihor Kendiukhov,
- Abstract summary: Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear.<n>We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening.<n>Results indicate that biological transformers learn an interpretable internal model of cellular organization.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear. We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening (183 hypotheses tested), revealing that the model organizes genes into a structured biological coordinate system rather than an opaque feature space. The dominant spectral axis separates genes by subcellular localization, with secreted proteins at one pole and cytosolic proteins at the other. Intermediate transformer layers transiently encode mitochondrial and ER compartments in a sequence that mirrors the cellular secretory pathway. Orthogonal axes encode protein-protein interaction networks with graded fidelity to experimentally measured interaction strength (Spearman rho = 1.000 across n = 5 STRING confidence quintiles, p = 0.017). In a compact six-dimensional spectral subspace, the model distinguishes transcription factors from their target genes (AUROC = 0.744, all 12 layers significant). Early layers preserve which specific genes regulate which targets, while deeper layers compress this into a coarser regulator versus regulated distinction. Repression edges are geometrically more prominent than activation edges, and B-cell master regulators BATF and BACH2 show convergence toward the B-cell identity anchor PAX5 across transformer depth. Cell-type marker genes cluster with high fidelity (AUROC = 0.851). Residual-stream geometry encodes biological structure complementary to attention patterns. These results indicate that biological transformers learn an interpretable internal model of cellular organization, with implications for regulatory network inference, drug target prioritization, and model auditing.
Related papers
- Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network [20.37811669228711]
Prioritizing disease-associated genes is central to understanding complex disorders such as Alzheimer's disease.<n>We propose NETRA, a multimodal graph transformer framework that replaces centrality metrics with attention-driven relevance scoring.<n>A graph transformer assigns NETRA scores that quantify gene relevance in a disease-specific and context-aware manner.
arXiv Detail & Related papers (2026-03-01T06:46:18Z) - What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses [0.0]
We propose an AI-driven-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations.<n>Gene embedding neighborhoods exhibit non-trivial topology, with persistent homology significant in 11 of 12 transformer layers.<n> CCA alignment between scGPT and Geneformer yields canonical correlation of 0.80 and gene retrieval accuracy of 72 percent, yet none of 19 tested methods reliably recover gene-level correspondences.
arXiv Detail & Related papers (2026-02-25T14:33:24Z) - STRAND: Sequence-Conditioned Transport for Single-Cell Perturbations [31.08466183513241]
STRAND is a generative model that predicts single-cell responses by conditioning on regulatory DNA sequence.<n>Representing perturbations by sequence, rather than by a fixed set of gene identifiers, supports zero-shot inference at loci not seen during training.<n>We evaluate STRAND on CRISPR perturbation datasets in K562, Jurkat, and RPE1 cells.
arXiv Detail & Related papers (2026-02-10T00:57:38Z) - Central Dogma Transformer: Towards Mechanism-Oriented AI for Cellular Understanding [0.0]
We present the Central Dogma Transformer (CDT), an architecture that integrates pre-trained language models for DNA, RNA, and protein.<n>We validate CDT v1 on CRISPRi enhancer perturbation data from K562 cells, achieving a Pearson correlation of 0.503.<n>These results suggest that AI architectures aligned with biological information flow can achieve both predictive accuracy and mechanistic interpretability.
arXiv Detail & Related papers (2026-01-03T06:29:22Z) - Conditional Morphogenesis: Emergent Generation of Structural Digits via Neural Cellular Automata [0.0]
We propose a novel Conditional Neural Cellular Automata architecture capable of growing distinct topological structures from a single generic seed.<n>By injecting a one-hot condition into the cellular perception field, a single set of local rules can learn to break symmetry and self-assemble into ten distinct geometric attractors.
arXiv Detail & Related papers (2025-12-09T08:36:54Z) - Tensor Network based Gene Regulatory Network Inference for Single-Cell Transcriptomic Data [0.0]
This study introduces a quantum-inspired framework leveraging tensor networks (TNs) to optimally map expression data.<n>We quantify gene dependencies and establish statistical significance via permutation testing.<n>By merging quantum physics inspired techniques with computational biology, our method provides novel insights into gene regulation.
arXiv Detail & Related papers (2025-09-08T17:11:12Z) - UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials [62.72989417755985]
We present UniGenX, a unified generative model for function in natural systems.<n>UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens.<n>It achieves state-of-the-art or competitive performance for the function-aware generation across domains.
arXiv Detail & Related papers (2025-03-09T16:43:07Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Granger causal inference on DAGs identifies genomic loci regulating
transcription [77.58911272503771]
GrID-Net is a framework based on graph neural networks with lagged message passing for Granger causal inference on DAG-structured systems.
Our application is the analysis of single-cell multimodal data to identify genomic loci that mediate the regulation of specific genes.
arXiv Detail & Related papers (2022-10-18T21:15:10Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - Growing Isotropic Neural Cellular Automata [63.91346650159648]
We argue that the original Growing NCA model has an important limitation: anisotropy of the learned update rule.
We demonstrate that cell systems can be trained to grow accurate asymmetrical patterns through either of two methods.
arXiv Detail & Related papers (2022-05-03T11:34:22Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.