Generative Capacity of Probabilistic Protein Sequence Models
- URL: http://arxiv.org/abs/2012.02296v2
- Date: Mon, 15 Mar 2021 21:15:58 GMT
- Title: Generative Capacity of Probabilistic Protein Sequence Models
- Authors: Francisco McGee, Quentin Novinger, Ronald M. Levy, Vincenzo Carnevale,
Allan Haldane
- Abstract summary: Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs)
It is currently unclear whether GPSMs can faithfully reproduce the complex multi-residue mutation patterns observed in natural sequences arising due to epistasis.
We develop a set of sequence statistics to assess the "generative capacity" of three GPSMs of recent interest.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Potts models and variational autoencoders (VAEs) have recently gained
popularity as generative protein sequence models (GPSMs) to explore fitness
landscapes and predict the effect of mutations. Despite encouraging results,
quantitative characterization and comparison of GPSM-generated probability
distributions is still lacking. It is currently unclear whether GPSMs can
faithfully reproduce the complex multi-residue mutation patterns observed in
natural sequences arising due to epistasis. We develop a set of sequence
statistics to assess the "generative capacity" of three GPSMs of recent
interest: the pairwise Potts Hamiltonian, the VAE, and the site-independent
model, using natural and synthetic datasets. We show that the generative
capacity of the Potts Hamiltonian model is the largest, in that the higher
order mutational statistics generated by the model agree with those observed
for natural sequences. In contrast, we show that the VAE's generative capacity
lies between the pairwise Potts and site-independent models. Importantly, our
work measures GPSM generative capacity in terms of higher-order sequence
covariation statistics which we have developed, and provides a new framework
for evaluating and interpreting GPSM accuracy that emphasizes the role of
epistasis.
Related papers
- Spatial Reasoning with Denoising Models [49.83744014336816]
We introduce a framework to perform reasoning over sets of continuous variables via denoising generative models.
We demonstrate for the first time, that order of generation can successfully be predicted by the denoising network itself.
arXiv Detail & Related papers (2025-02-28T14:08:30Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction [88.65168366064061]
We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference.
Our framework leads to a family of three novel objectives that are all simulation-free, and thus scalable.
We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.
arXiv Detail & Related papers (2024-10-10T17:18:30Z) - Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen [76.02070962797794]
We present Cell Flow for Generation, a flow-based conditional generative model for multi-modal single-cell counts.
Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - A Poisson-Gamma Dynamic Factor Model with Time-Varying Transition Dynamics [51.147876395589925]
A non-stationary PGDS is proposed to allow the underlying transition matrices to evolve over time.
A fully-conjugate and efficient Gibbs sampler is developed to perform posterior simulation.
Experiments show that, in comparison with related models, the proposed non-stationary PGDS achieves improved predictive performance.
arXiv Detail & Related papers (2024-02-26T04:39:01Z) - Synthetic location trajectory generation using categorical diffusion
models [50.809683239937584]
Diffusion models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data.
We propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals.
arXiv Detail & Related papers (2024-02-19T15:57:39Z) - Mixed Effects Neural ODE: A Variational Approximation for Analyzing the
Dynamics of Panel Data [50.23363975709122]
We propose a probabilistic model called ME-NODE to incorporate (fixed + random) mixed effects for analyzing panel data.
We show that our model can be derived using smooth approximations of SDEs provided by the Wong-Zakai theorem.
We then derive Evidence Based Lower Bounds for ME-NODE, and develop (efficient) training algorithms.
arXiv Detail & Related papers (2022-02-18T22:41:51Z) - Optimal regularizations for data generation with probabilistic graphical
models [0.0]
Empirically, well-chosen regularization schemes dramatically improve the quality of the inferred models.
We consider the particular case of L 2 and L 1 regularizations in the Maximum A Posteriori (MAP) inference of generative pairwise graphical models.
arXiv Detail & Related papers (2021-12-02T14:45:16Z) - PhyloTransformer: A Discriminative Model for Mutation Prediction Based
on a Multi-head Self-attention Mechanism [10.468453827172477]
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused an ongoing pandemic infecting 219 million people as of 10/19/21, with a 3.6% mortality rate.
Here we developed PhyloTransformer, a Transformer-based discriminative model that engages a multi-head self-attention mechanism to model genetic mutations that may lead to viral reproductive advantage.
arXiv Detail & Related papers (2021-11-03T01:30:57Z) - Anomaly Detection of Time Series with Smoothness-Inducing Sequential
Variational Auto-Encoder [59.69303945834122]
We present a Smoothness-Inducing Sequential Variational Auto-Encoder (SISVAE) model for robust estimation and anomaly detection of time series.
Our model parameterizes mean and variance for each time-stamp with flexible neural networks.
We show the effectiveness of our model on both synthetic datasets and public real-world benchmarks.
arXiv Detail & Related papers (2021-02-02T06:15:15Z) - Increased peak detection accuracy in over-dispersed ChIP-seq data with
supervised segmentation models [2.2559617939136505]
We show that unconstrained multiple changepoint detection model, with alternative noise assumptions and a suitable setup, reduces the over-dispersion exhibited by count data.
Results: We show that the unconstrained multiple changepoint detection model, with alternative noise assumptions and a suitable setup, reduces the over-dispersion exhibited by count data.
arXiv Detail & Related papers (2020-12-12T16:03:27Z) - Sparse generative modeling via parameter-reduction of Boltzmann
machines: application to protein-sequence families [0.0]
Boltzmann machines (BM) are widely used as generative models.
We introduce a general parameter-reduction procedure for BMs.
For several protein families, our procedure allows one to remove more than $90%$ of the PM couplings.
arXiv Detail & Related papers (2020-11-23T08:01:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.