Beyond Independent Genes: Learning Module-Inductive Representations for Gene Perturbation Prediction
- URL: http://arxiv.org/abs/2602.04901v1
- Date: Tue, 03 Feb 2026 16:43:40 GMT
- Title: Beyond Independent Genes: Learning Module-Inductive Representations for Gene Perturbation Prediction
- Authors: Jiafa Ruan, Ruijie Quan, Zongxin Yang, Liyang Xu, Yi Yang,
- Abstract summary: scBIG is a module-inductive prediction framework that explicitly models coordinated gene programs.<n> scBIG consistently outperforms state-of-the-art methods, particularly on unseen and perturbation settings.
- Score: 48.80217316452559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predicting transcriptional responses to genetic perturbations is a central problem in functional genomics. In practice, perturbation responses are rarely gene-independent but instead manifest as coordinated, program-level transcriptional changes among functionally related genes. However, most existing methods do not explicitly model such coordination, due to gene-wise modeling paradigms and reliance on static biological priors that cannot capture dynamic program reorganization. To address these limitations, we propose scBIG, a module-inductive perturbation prediction framework that explicitly models coordinated gene programs. scBIG induces coherent gene programs from data via Gene-Relation Clustering, captures inter-program interactions through a Gene-Cluster-Aware Encoder, and preserves modular coordination using structure-aware alignment objectives. These structured representations are then modeled using conditional flow matching to enable flexible and generalizable perturbation prediction. Extensive experiments on multiple single-cell perturbation benchmarks show that scBIG consistently outperforms state-of-the-art methods, particularly on unseen and combinatorial perturbation settings, achieving an average improvement of 6.7% over the strongest baselines.
Related papers
- STRAND: Sequence-Conditioned Transport for Single-Cell Perturbations [31.08466183513241]
STRAND is a generative model that predicts single-cell responses by conditioning on regulatory DNA sequence.<n>Representing perturbations by sequence, rather than by a fixed set of gene identifiers, supports zero-shot inference at loci not seen during training.<n>We evaluate STRAND on CRISPR perturbation datasets in K562, Jurkat, and RPE1 cells.
arXiv Detail & Related papers (2026-02-10T00:57:38Z) - Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models [11.343106383645441]
We introduce a scalable latent diffusion model for single-cell gene expression data, which we refer to as scLDM.<n>We show its superior performance in a variety of experiments for both observational and perturbational single-cell data, as well as downstream tasks like cell-level classification.
arXiv Detail & Related papers (2025-11-04T20:44:12Z) - GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z) - A scalable gene network model of regulatory dynamics in single cells [88.48246132084441]
We introduce a Functional Learnable model of Cell dynamicS, FLeCS, that incorporates gene network structure into coupled differential equations to model gene regulatory functions.<n>Given (pseudo)time-series single-cell data, FLeCS accurately infers cell dynamics at scale.
arXiv Detail & Related papers (2025-03-25T19:19:21Z) - CausalGeD: Blending Causality and Diffusion for Spatial Gene Expression Generation [11.664880068737084]
We present CausalGeD, which combines diffusion and autoregressive processes to leverage causal relationships between genes.<n>Across 10 tissue datasets, CausalGeD outperformed state-of-the-art baselines by 5- 32% in key metrics.
arXiv Detail & Related papers (2025-02-11T18:26:22Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders [16.010515254476626]
We show that selection is ubiquitous and, when ignored or conflated with true regulations, can lead to flawed causal interpretation and misguided intervention recommendations.<n>We propose GISL (Gene regulatory network Inference in the presence of Selection bias and Latent confounders), a principled algorithm that leverages perturbation data to uncover true gene regulatory relations and non-regulatory mechanisms of selection and confounding up to the equivalence class.
arXiv Detail & Related papers (2025-01-17T11:27:58Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances.
BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules.
BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z) - FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics [46.189419603576084]
FGBERT is a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware tokenizer.<n>It demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels.
arXiv Detail & Related papers (2024-02-24T13:13:17Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.