Deep Generative Models for Discrete Genotype Simulation
- URL: http://arxiv.org/abs/2508.09212v1
- Date: Mon, 11 Aug 2025 11:56:03 GMT
- Title: Deep Generative Models for Discrete Genotype Simulation
- Authors: Sihan Xie, Thierry Tribout, Didier Boichard, Blaise Hanczar, Julien Chiquet, Eric Barrey,
- Abstract summary: This study explores generating genotype data in both unconditioned and phenotype-conditioned settings.<n>We developed and evaluated commonly used generative models, including Variational Autoencoders (VAEs), Diffusion Models, and Generative Adversarial Networks (GANs)<n>Our results show that these models can effectively capture genetic patterns and preserve genotype-phenotype association.
- Score: 2.1880699665839396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep generative models open new avenues for simulating realistic genomic data while preserving privacy and addressing data accessibility constraints. While previous studies have primarily focused on generating gene expression or haplotype data, this study explores generating genotype data in both unconditioned and phenotype-conditioned settings, which is inherently more challenging due to the discrete nature of genotype data. In this work, we developed and evaluated commonly used generative models, including Variational Autoencoders (VAEs), Diffusion Models, and Generative Adversarial Networks (GANs), and proposed adaptation tailored to discrete genotype data. We conducted extensive experiments on large-scale datasets, including all chromosomes from cow and multiple chromosomes from human. Model performance was assessed using a well-established set of metrics drawn from both deep learning and quantitative genetics literature. Our results show that these models can effectively capture genetic patterns and preserve genotype-phenotype association. Our findings provide a comprehensive comparison of these models and offer practical guidelines for future research in genotype simulation. We have made our code publicly available at https://github.com/SihanXXX/DiscreteGenoGen.
Related papers
- GeMM-GAN: A Multimodal Generative Model Conditioned on Histopathology Images and Clinical Descriptions for Gene Expression Profile Generation [0.6608945629704325]
We present GeMM-GAN, a novel Generative Adversarial Network conditioned on histopathology tissue slides and clinical metadata, designed to synthesize realistic gene expression profiles.<n>We evaluate our approach on the TCGA dataset and demonstrate that our framework outperforms standard generative models and generates more realistic and functionally meaningful gene expression profiles.
arXiv Detail & Related papers (2026-01-21T19:03:54Z) - Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra [44.253701408005895]
We leverage modern fast linear algebra techniques to develop DeepWAS, a method to train large neural network predictive models to optimize likelihood.<n>We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.
arXiv Detail & Related papers (2025-06-24T13:07:45Z) - GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z) - Inferring genotype-phenotype maps using attention models [0.21990652930491852]
Predicting phenotype from genotype is a central challenge in genetics.<n>Recent advances in machine learning, particularly attention-based models, offer a promising alternative.<n>Here, we apply attention-based models to quantitative genetics.
arXiv Detail & Related papers (2025-04-14T16:32:17Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion [108.94237816552024]
We propose the first genotype-to-phenotype diffusion model (G2PDiffusion) that generates morphological images from DNA.<n>The model contains three novel components: 1) a MSA retrieval engine that identifies conserved and co-evolutionary patterns; 2) an environment-aware MSA conditional encoder that effectively models complex genotype-environment interactions; and 3) an adaptive phenomic alignment module to improve genotype-phenotype consistency.
arXiv Detail & Related papers (2025-02-07T06:16:31Z) - Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen [76.02070962797794]
This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data.<n>CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - A Semi-Supervised Generative Adversarial Network for Prediction of
Genetic Disease Outcomes [0.0]
We introduce genetic Generative Adversarial Networks (gGAN) to create large synthetic genetic data sets.
Our goal is to determine the propensity of a new individual to develop the severe form of the illness from their genetic profile alone.
The proposed model is self-aware and capable of determining whether a new genetic profile has enough compatibility with the data on which the network was trained.
arXiv Detail & Related papers (2020-07-02T15:35:14Z) - A Sparse Graph-Structured Lasso Mixed Model for Genetic Association with
Confounding Correction [28.364820868064893]
We propose the sparse graph-structured linear mixed model (sGLMM) that can incorporate the relatedness information from traits in a dataset with confounding correction.
We show that the proposed model outperforms other existing approaches and can model correlation from both population structure and shared signals.
We also discuss the potential causal genetic variation of Human Alzheimer's disease discovered by our model and justify some of the most important genetic loci.
arXiv Detail & Related papers (2017-11-11T16:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.