Seek and You Shall Fold
- URL: http://arxiv.org/abs/2511.13244v1
- Date: Mon, 17 Nov 2025 11:07:49 GMT
- Title: Seek and You Shall Fold
- Authors: Nadav Bojan Sellam, Meital Bojan, Paul Schanda, Alex Bronstein,
- Abstract summary: Most predictors of experimental observables are non-differentiable, making them incompatible with gradient-based conditional sampling.<n>This is especially limiting in nuclear magnetic resonance, where rich data such as chemical shifts are hard to directly integrate into generative modeling.<n>We introduce a framework for non-differentiable guidance of protein generative models, coupling a continuous diffusion-based generator with any black-box objective via a tailored genetic algorithm.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate protein structures are essential for understanding biological function, yet incorporating experimental data into protein generative models remains a major challenge. Most predictors of experimental observables are non-differentiable, making them incompatible with gradient-based conditional sampling. This is especially limiting in nuclear magnetic resonance, where rich data such as chemical shifts are hard to directly integrate into generative modeling. We introduce a framework for non-differentiable guidance of protein generative models, coupling a continuous diffusion-based generator with any black-box objective via a tailored genetic algorithm. We demonstrate its effectiveness across three modalities: pairwise distance constraints, nuclear Overhauser effect restraints, and for the first time chemical shifts. These results establish chemical shift guided structure generation as feasible, expose key weaknesses in current predictors, and showcase a general strategy for incorporating diverse experimental signals. Our work points toward automated, data-conditioned protein modeling beyond the limits of differentiability.
Related papers
- Unlocking hidden biomolecular conformational landscapes in diffusion models at inference time [0.828988908878327]
We present ConforMix, an inference-time algorithm that enhances sampling of conformational distributions.<n>Our approach upgrades diffusion models to enable more efficient discovery of conformational variability.<n>Case studies of biologically critical proteins demonstrate the scalability, accuracy, and utility of this method.
arXiv Detail & Related papers (2025-12-02T23:52:05Z) - Demystify Protein Generation with Hierarchical Conditional Diffusion Models [17.174551222714722]
We propose a novel conditional diffusion model for efficient end-to-end protein design guided by specified functions.<n>By generating representations at different levels simultaneously, our framework can effectively model the inherent hierarchical relations between different levels.<n>We also propose a Protein-MMD, a new reliable evaluation metric, to evaluate the quality of generated protein.
arXiv Detail & Related papers (2025-07-24T17:34:02Z) - Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges [68.98973318553983]
We propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions.<n>We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way.<n>We also incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles.
arXiv Detail & Related papers (2025-06-26T09:05:38Z) - NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models [63.592664795493725]
NOBLE is a neural operator framework that learns a mapping from a continuous frequency-modulated embedding of interpretable neuron features to the somatic voltage response induced by current injection.<n>It predicts distributions of neural dynamics accounting for the intrinsic experimental variability.<n>NOBLE is the first scaled-up deep learning framework that validates its generalization with real experimental data.
arXiv Detail & Related papers (2025-06-05T01:01:18Z) - UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials [62.72989417755985]
We present UniGenX, a unified generative model for function in natural systems.<n>UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens.<n>It achieves state-of-the-art or competitive performance for the function-aware generation across domains.
arXiv Detail & Related papers (2025-03-09T16:43:07Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design [56.957070405026194]
We propose an algorithm that enables direct backpropagation of rewards through entire trajectories generated by diffusion models.<n>DRAKES can generate sequences that are both natural-like and yield high rewards.
arXiv Detail & Related papers (2024-10-17T15:10:13Z) - Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions [4.36852565205713]
We present OmniBioTE, the largest open-source multi-omic model trained on over 250 billion tokens of mixed protein and nucleic acid data.<n>We show that OmbiBioTE achieves state-of-the-art results predicting the change in Gibbs free energy of the binding interaction between a given nucleic acid and protein.
arXiv Detail & Related papers (2024-08-29T03:56:40Z) - Data-Error Scaling Laws in Machine Learning on Combinatorial Mutation-prone Sets: Proteins and Small Molecules [0.0]
We investigate trends in the data-error scaling laws of machine learning models trained on discrete spaces prone-to-mutation.<n>In contrast to typical data-error scaling laws, our results showed discontinuous monotonic phase transitions during learning.<n>We present an alternative strategy to normalize learning curves and introduce the concept of mutant-based shuffling.
arXiv Detail & Related papers (2024-05-08T16:04:50Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics [44.97217246897902]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - SESNet: sequence-structure feature-integrated deep learning method for
data-efficient protein engineering [6.216757583450049]
We develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants.
We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship.
Our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants.
arXiv Detail & Related papers (2022-12-29T01:49:52Z) - Protein Structure and Sequence Generation with Equivariant Denoising
Diffusion Probabilistic Models [3.5450828190071646]
An important task in bioengineering is designing proteins with specific 3D structures and chemical properties which enable targeted functions.
We introduce a generative model of both protein structure and sequence that can operate at significantly larger scales than previous molecular generative modeling approaches.
arXiv Detail & Related papers (2022-05-26T16:10:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.