Latent Diffusion Model for DNA Sequence Generation
- URL: http://arxiv.org/abs/2310.06150v2
- Date: Sun, 24 Dec 2023 23:14:35 GMT
- Title: Latent Diffusion Model for DNA Sequence Generation
- Authors: Zehui Li, Yuhao Ni, Tim August B. Huygelen, Akashaditya Das, Guoxuan
Xia, Guy-Bart Stan, Yiren Zhao
- Abstract summary: We propose a novel latent diffusion model, DiscDiff, tailored for discrete DNA sequence generation.
By simply embedding discrete DNA sequences into a continuous latent space using an autoencoder, we are able to leverage the powerful generative abilities of continuous diffusion models for the generation of discrete data.
We contribute a comprehensive cross-species dataset of 150K unique promoter-gene sequences from 15 species, enriching resources for future generative modelling in genomics.
- Score: 5.194506374366898
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The harnessing of machine learning, especially deep generative models, has
opened up promising avenues in the field of synthetic DNA sequence generation.
Whilst Generative Adversarial Networks (GANs) have gained traction for this
application, they often face issues such as limited sample diversity and mode
collapse. On the other hand, Diffusion Models are a promising new class of
generative models that are not burdened with these problems, enabling them to
reach the state-of-the-art in domains such as image generation. In light of
this, we propose a novel latent diffusion model, DiscDiff, tailored for
discrete DNA sequence generation. By simply embedding discrete DNA sequences
into a continuous latent space using an autoencoder, we are able to leverage
the powerful generative abilities of continuous diffusion models for the
generation of discrete data. Additionally, we introduce Fr\'echet
Reconstruction Distance (FReD) as a new metric to measure the sample quality of
DNA sequence generations. Our DiscDiff model demonstrates an ability to
generate synthetic DNA sequences that align closely with real DNA in terms of
Motif Distribution, Latent Embedding Distribution (FReD), and Chromatin
Profiles. Additionally, we contribute a comprehensive cross-species dataset of
150K unique promoter-gene sequences from 15 species, enriching resources for
future generative modelling in genomics. We will make our code public upon
publication.
Related papers
- Absorb & Escape: Overcoming Single Model Limitations in Generating Genomic Sequences [4.946462450157714]
We analyze the properties of AutoRegressive (AR) models and Diffusion Models (DMs) in genomic sequence generation.
We propose a post-training sampling method, termed Absorb & Escape (A&E) to perform compositional generation.
Experiment results show A&E outperforms state-of-the-art AR models and DMs in genomic sequence generation.
arXiv Detail & Related papers (2024-10-28T07:00:27Z) - Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design [56.957070405026194]
We propose an algorithm that enables direct backpropagation of rewards through entire trajectories generated by diffusion models.
DRAKES can generate sequences that are both natural-like and yield high rewards.
arXiv Detail & Related papers (2024-10-17T15:10:13Z) - Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding [84.3224556294803]
Diffusion models excel at capturing the natural design spaces of images, molecules, DNA, RNA, and protein sequences.
We aim to optimize downstream reward functions while preserving the naturalness of these design spaces.
Our algorithm integrates soft value functions, which looks ahead to how intermediate noisy states lead to high rewards in the future.
arXiv Detail & Related papers (2024-08-15T16:47:59Z) - Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen [76.02070962797794]
We present Cell Flow for Generation, a flow-based conditional generative model for multi-modal single-cell counts.
Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - DiscDiff: Latent Diffusion Model for DNA Sequence Generation [4.946462450157714]
We introduce DiscDiff, a Latent Diffusion Model tailored for generating discrete DNA sequences, and Absorb-Escape, a post-training algorithm designed to refine these sequences.
EPD-GenDNA is the first comprehensive, multi-species dataset for DNA generation, encompassing 160,000 unique sequences from 15 species.
We hope this study will advance the generative modelling of DNA, with potential implications for gene therapy and protein production.
arXiv Detail & Related papers (2024-02-08T22:06:55Z) - Dirichlet Diffusion Score Model for Biological Sequence Generation [2.0910267321492926]
Diffusion generative models have achieved considerable success in many applications.
We introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution.
This makes diffusion in continuous space natural for modeling discrete data.
arXiv Detail & Related papers (2023-05-18T04:24:31Z) - StyleGenes: Discrete and Efficient Latent Distributions for GANs [149.0290830305808]
We propose a discrete latent distribution for Generative Adversarial Networks (GANs)
Instead of drawing latent vectors from a continuous prior, we sample from a finite set of learnable latents.
We take inspiration from the encoding of information in biological organisms.
arXiv Detail & Related papers (2023-04-30T23:28:46Z) - A Survey on Generative Diffusion Model [75.93774014861978]
Diffusion models are an emerging class of deep generative models.
They have certain limitations, including a time-consuming iterative generation process and confinement to high-dimensional Euclidean space.
This survey presents a plethora of advanced techniques aimed at enhancing diffusion models.
arXiv Detail & Related papers (2022-09-06T16:56:21Z) - Conditional Hybrid GAN for Sequence Generation [56.67961004064029]
We propose a novel conditional hybrid GAN (C-Hybrid-GAN) to solve this issue.
We exploit the Gumbel-Softmax technique to approximate the distribution of discrete-valued sequences.
We demonstrate that the proposed C-Hybrid-GAN outperforms the existing methods in context-conditioned discrete-valued sequence generation.
arXiv Detail & Related papers (2020-09-18T03:52:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.