Central Dogma Transformer II: An AI Microscope for Understanding Cellular Regulatory Mechanisms
- URL: http://arxiv.org/abs/2602.08751v2
- Date: Thu, 12 Feb 2026 16:40:32 GMT
- Title: Central Dogma Transformer II: An AI Microscope for Understanding Cellular Regulatory Mechanisms
- Authors: Nobuyuki Ota,
- Abstract summary: We present CDT-II, an "AI microscope" whose attention maps are directly interpretable as regulatory structure.<n>By mirroring the central dogma in its architecture, CDT-II ensures that each attention mechanism corresponds to a specific biological relationship.<n>Applying to K562 CRISPRi data, CDT-II predicts perturbation effects and recovers the GFI1B regulatory network without supervision.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current biological AI models lack interpretability -- their internal representations do not correspond to biological relationships that researchers can examine. Here we present CDT-II, an "AI microscope" whose attention maps are directly interpretable as regulatory structure. By mirroring the central dogma in its architecture, CDT-II ensures that each attention mechanism corresponds to a specific biological relationship: DNA self-attention for genomic relationships, RNA self-attention for gene co-regulation, and DNA-to-RNA cross-attention for transcriptional control. Using only genomic embeddings and raw per-cell expression, CDT-II enables experimental biologists to observe regulatory networks in their own data. Applied to K562 CRISPRi data, CDT-II predicts perturbation effects (per-gene mean $r = 0.84$) and recovers the GFI1B regulatory network without supervision (6.6-fold enrichment, $P = 3.5 \times 10^{-17}$). Systematic comparison against ENCODE K562 regulatory annotations reveals that cross-attention autonomously focuses on known regulatory elements -- DNase hypersensitive sites ($201\times$ enrichment), CTCF binding sites ($28\times$), and histone marks -- across all five held-out genes. Two distinct attention mechanisms independently identify an overlapping RNA processing module (80% gene overlap; RNA binding enrichment $P = 1 \times 10^{-16}$). CDT-II establishes mechanism-oriented AI as an alternative to task-oriented approaches, revealing regulatory structure rather than merely optimizing predictions.
Related papers
- Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network [20.37811669228711]
Prioritizing disease-associated genes is central to understanding complex disorders such as Alzheimer's disease.<n>We propose NETRA, a multimodal graph transformer framework that replaces centrality metrics with attention-driven relevance scoring.<n>A graph transformer assigns NETRA scores that quantify gene relevance in a disease-specific and context-aware manner.
arXiv Detail & Related papers (2026-03-01T06:46:18Z) - Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations [0.0]
Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear.<n>We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening.<n>Results indicate that biological transformers learn an interpretable internal model of cellular organization.
arXiv Detail & Related papers (2026-02-24T17:57:59Z) - Systematic Evaluation of Single-Cell Foundation Model Interpretability Reveals Attention Captures Co-Expression Rather Than Unique Regulatory Signal [0.0]
We present a framework for assessing mechanistic interpretability in single-cell foundation models.<n>Applying this framework to scGPT and Geneformer, we find that attention patterns encode structured biological information with layer-specific organisation.
arXiv Detail & Related papers (2026-02-19T16:43:12Z) - STRAND: Sequence-Conditioned Transport for Single-Cell Perturbations [31.08466183513241]
STRAND is a generative model that predicts single-cell responses by conditioning on regulatory DNA sequence.<n>Representing perturbations by sequence, rather than by a fixed set of gene identifiers, supports zero-shot inference at loci not seen during training.<n>We evaluate STRAND on CRISPR perturbation datasets in K562, Jurkat, and RPE1 cells.
arXiv Detail & Related papers (2026-02-10T00:57:38Z) - Central Dogma Transformer: Towards Mechanism-Oriented AI for Cellular Understanding [0.0]
We present the Central Dogma Transformer (CDT), an architecture that integrates pre-trained language models for DNA, RNA, and protein.<n>We validate CDT v1 on CRISPRi enhancer perturbation data from K562 cells, achieving a Pearson correlation of 0.503.<n>These results suggest that AI architectures aligned with biological information flow can achieve both predictive accuracy and mechanistic interpretability.
arXiv Detail & Related papers (2026-01-03T06:29:22Z) - TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis [56.9460577864211]
TRIDENT is a cascade generative framework that synthesizes realistic cellular morphology by conditioning on both the perturbation and the corresponding gene expression profile.<n> TRIDENT significantly outperforms state-of-the-art approaches, achieving up to 7-fold improvement with strong generalization to unseen compounds.
arXiv Detail & Related papers (2025-11-23T04:43:27Z) - Tensor Network based Gene Regulatory Network Inference for Single-Cell Transcriptomic Data [0.0]
This study introduces a quantum-inspired framework leveraging tensor networks (TNs) to optimally map expression data.<n>We quantify gene dependencies and establish statistical significance via permutation testing.<n>By merging quantum physics inspired techniques with computational biology, our method provides novel insights into gene regulation.
arXiv Detail & Related papers (2025-09-08T17:11:12Z) - A scalable gene network model of regulatory dynamics in single cells [88.48246132084441]
We introduce a Functional Learnable model of Cell dynamicS, FLeCS, that incorporates gene network structure into coupled differential equations to model gene regulatory functions.<n>Given (pseudo)time-series single-cell data, FLeCS accurately infers cell dynamics at scale.
arXiv Detail & Related papers (2025-03-25T19:19:21Z) - Regulatory DNA sequence Design with Reinforcement Learning [56.20290878358356]
We propose a generative approach that leverages reinforcement learning to fine-tune a pre-trained autoregressive model.<n>We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types.
arXiv Detail & Related papers (2025-03-11T02:33:33Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - CRISPR-GPT for Agentic Automation of Gene-editing Experiments [57.10950429181712]
Large Language Models (LLMs) have shown promise in various tasks, but they often lack specific knowledge and struggle to accurately solve biological design problems.<n>In this work, we introduce CRISPR-GPT, an LLM agent augmented with domain knowledge and external tools to automate and enhance the design process of CRISPR-based gene-editing experiments.<n>We showcase the potential of CRISPR-GPT for assisting non-expert researchers with gene-editing experiments from scratch and validate the agent's effectiveness in a real-world use case.
arXiv Detail & Related papers (2024-04-27T22:59:17Z) - scCDCG: Efficient Deep Structural Clustering for single-cell RNA-seq via Deep Cut-informed Graph Embedding [23.163052968111103]
scCDCG (single-cell RNA-seq Clustering via Deep Cut-informed Graph) is a novel framework designed for efficient and accurate clustering of scRNA-seq data.<n> scCDCG comprises three main components: (i) A graph embedding module utilizing deep cut-informed techniques, which effectively captures intercellular high-order structural information.<n> (ii) A self-supervised learning module guided by optimal transport, tailored to accommodate the unique complexities of scRNA-seq data.
arXiv Detail & Related papers (2024-04-09T09:46:17Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - An Integrated Deep Learning and Dynamic Programming Method for
Predicting Tumor Suppressor Genes, Oncogenes, and Fusion from PDB Structures [0.0]
Mutations in proto-oncogenes (ONGO) and the loss of regulatory function of tumor suppression genes (TSG) are the common underlying mechanism for uncontrolled tumor growth.
Finding the potentiality of the genes related functionality to ONGO or TSG through computational studies can help develop drugs that target the disease.
This paper proposes a classification method that starts with a preprocessing stage to extract the feature map sets from the input 3D protein structural information.
arXiv Detail & Related papers (2021-05-17T18:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.