GenePheno: Interpretable Gene Knockout-Induced Phenotype Abnormality Prediction from Gene Sequences
- URL: http://arxiv.org/abs/2511.09512v2
- Date: Fri, 14 Nov 2025 20:42:29 GMT
- Title: GenePheno: Interpretable Gene Knockout-Induced Phenotype Abnormality Prediction from Gene Sequences
- Authors: Jingquan Yan, Yuwei Miao, Lei Yu, Yuzhi Guo, Xue Xiao, Lin Xu, Junzhou Huang,
- Abstract summary: We introduce GenePheno, the first interpretable multi-label prediction framework that predicts knockout induced phenotypic abnormalities from gene sequences.<n>GenePheno employs a contrastive multi-label learning objective that captures inter-phenotype correlations, complemented by an exclusive regularization that enforces biological consistency.<n>GenePheno achieves state-of-the-art gene-centric $F_textmax$ and phenotype-centric AUC, and case studies demonstrate its ability to reveal gene functional mechanisms.
- Score: 23.32906953921921
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exploring how genetic sequences shape phenotypes is a fundamental challenge in biology and a key step toward scalable, hypothesis-driven experimentation. The task is complicated by the large modality gap between sequences and phenotypes, as well as the pleiotropic nature of gene-phenotype relationships. Existing sequence-based efforts focus on the degree to which variants of specific genes alter a limited set of phenotypes, while general gene knockout induced phenotype abnormality prediction methods heavily rely on curated genetic information as inputs, which limits scalability and generalizability. As a result, the task of broadly predicting the presence of multiple phenotype abnormalities under gene knockout directly from gene sequences remains underexplored. We introduce GenePheno, the first interpretable multi-label prediction framework that predicts knockout induced phenotypic abnormalities from gene sequences. GenePheno employs a contrastive multi-label learning objective that captures inter-phenotype correlations, complemented by an exclusive regularization that enforces biological consistency. It further incorporates a gene function bottleneck layer, offering human interpretable concepts that reflect functional mechanisms behind phenotype formation. To support progress in this area, we curate four datasets with canonical gene sequences as input and multi-label phenotypic abnormalities induced by gene knockouts as targets. Across these datasets, GenePheno achieves state-of-the-art gene-centric $F_{\text{max}}$ and phenotype-centric AUC, and case studies demonstrate its ability to reveal gene functional mechanisms.
Related papers
- Beyond Independent Genes: Learning Module-Inductive Representations for Gene Perturbation Prediction [48.80217316452559]
scBIG is a module-inductive prediction framework that explicitly models coordinated gene programs.<n> scBIG consistently outperforms state-of-the-art methods, particularly on unseen and perturbation settings.
arXiv Detail & Related papers (2026-02-03T16:43:40Z) - GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z) - G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion [108.94237816552024]
We propose the first genotype-to-phenotype diffusion model (G2PDiffusion) that generates morphological images from DNA.<n>The model contains three novel components: 1) a MSA retrieval engine that identifies conserved and co-evolutionary patterns; 2) an environment-aware MSA conditional encoder that effectively models complex genotype-environment interactions; and 3) an adaptive phenomic alignment module to improve genotype-phenotype consistency.
arXiv Detail & Related papers (2025-02-07T06:16:31Z) - GeneQuery: A General QA-based Framework for Spatial Gene Expression Predictions from Histology Images [41.732831871866516]
Whole-slide hematoxylin and eosin stained histological images are readily accessible and allow for detailed examinations of tissue structure and composition at the microscopic level.<n>Recent advancements have utilized these histological images to predict spatially resolved gene expression profiles.<n>GeneQuery aims to solve this gene expression prediction task in a question-answering (QA) manner for better generality and flexibility.
arXiv Detail & Related papers (2024-11-27T14:33:13Z) - Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances.
BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules.
BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z) - On The Nature Of The Phenotype In Tree Genetic Programming [3.8642945120580703]
We discuss the basic concepts of genotypes and phenotypes in tree-based GP (TGP)
We then analyze their behavior using five benchmark datasets.
To generate phenotypes, we provide a unique technique for removing semantically ineffective code from GP trees.
arXiv Detail & Related papers (2024-02-12T19:19:29Z) - Unsupervised ensemble-based phenotyping helps enhance the
discoverability of genes related to heart morphology [57.25098075813054]
We propose a new framework for gene discovery entitled Un Phenotype Ensembles.
It builds a redundant yet highly expressive representation by pooling a set of phenotypes learned in an unsupervised manner.
These phenotypes are then analyzed via (GWAS), retaining only highly confident and stable associations.
arXiv Detail & Related papers (2023-01-07T18:36:44Z) - rfPhen2Gen: A machine learning based association study of brain imaging
phenotypes to genotypes [71.1144397510333]
We learned machine learning models to predict SNPs using 56 brain imaging QTs.
SNPs within the known Alzheimer disease (AD) risk gene APOE had lowest RMSE for lasso and random forest.
Random forests identified additional SNPs that were not prioritized by the linear models but are known to be associated with brain-related disorders.
arXiv Detail & Related papers (2022-03-31T20:15:22Z) - Handling highly correlated genes in prediction analysis of genomic
studies [0.0]
High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models.
We propose a grouping algorithm, which treats highly correlated genes as a group and uses their common pattern to represent the group's biological signal in feature selection.
Our proposed grouping method has two advantages. First, using the gene group's common patterns makes the prediction more robust and reliable under condition change.
arXiv Detail & Related papers (2020-07-05T22:14:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.