AI-based Methods for Simulating, Sampling, and Predicting Protein Ensembles
- URL: http://arxiv.org/abs/2509.17224v1
- Date: Sun, 21 Sep 2025 20:14:45 GMT
- Title: AI-based Methods for Simulating, Sampling, and Predicting Protein Ensembles
- Authors: Bowen Jing, Bonnie Berger, Tommi Jaakkola,
- Abstract summary: Review highlights recent research directions towards AI-based predictions of protein ensembles.<n>An emphasis is placed on realistic assessments of the technological maturity of current methods.<n>We advocate for "closing the loop" between model training, simulation, and inference to overcome challenges in training data availability.
- Score: 16.635202322160243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in deep learning have opened an era of abundant and accurate predicted protein structures; however, similar progress in protein ensembles has remained elusive. This review highlights several recent research directions towards AI-based predictions of protein ensembles, including coarse-grained force fields, generative models, multiple sequence alignment perturbation methods, and modeling of ensemble descriptors. An emphasis is placed on realistic assessments of the technological maturity of current methods, the strengths and weaknesses of broad families of techniques, and promising machine learning frameworks at an early stage of development. We advocate for "closing the loop" between model training, simulation, and inference to overcome challenges in training data availability and to enable the next generation of models.
Related papers
- Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles [74.32932832937618]
We introduce $textbfRigidSSL$ ($textitRigidity-Aware Self-Supervised Learning$), a geometric pretraining framework.<n>Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations.<n>Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions.
arXiv Detail & Related papers (2026-03-02T21:32:30Z) - Protein Autoregressive Modeling via Multiscale Structure Generation [51.92004892768298]
We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation.<n>We adopt noisy context learning and scheduled sampling, enabling robust backbone generation.<n>On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality.
arXiv Detail & Related papers (2026-02-04T18:59:49Z) - GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins [0.43981305860983716]
We introduce GeoGraph, a simulation-informed surrogate trained to predict ensemble-averaged statistics of residue-residue contact-map topology directly from sequence.<n>By featurizing coarse-grained molecular dynamics simulations into residue- and sequence-level graph descriptors, we create a robust and information-rich learning target.
arXiv Detail & Related papers (2025-10-01T11:13:53Z) - AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model [92.51919604882984]
We introduce AMix-1, a powerful protein foundation model built on Flow Bayesian Networks.<n>AMix-1 is empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm.<n>Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework.
arXiv Detail & Related papers (2025-07-11T17:02:25Z) - Advanced Deep Learning Methods for Protein Structure Prediction and Design [28.575821996185024]
We comprehensively explore advanced deep learning methods applied to protein structure prediction and design.<n>The text analyses key components including structure generation, evaluation metrics, multiple sequence alignment processing, and network architecture.<n> Strategies for enhancing prediction accuracy and integrating deep learning techniques with experimental validation are thoroughly explored.
arXiv Detail & Related papers (2025-03-14T21:28:29Z) - Speculative Decoding and Beyond: An In-Depth Survey of Techniques [4.165029665035158]
Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models.<n>Recent advances in generation-refinement frameworks demonstrate that this trade-off can be significantly mitigated.
arXiv Detail & Related papers (2025-02-27T03:53:45Z) - Learning conformational ensembles of proteins based on backbone geometry [1.1874952582465603]
We propose a flow matching model for sampling protein conformations based solely on backbone geometry.<n>The resulting model is orders of magnitudes faster than current state-of-the-art approaches at comparable accuracy and can be trained from scratch in a few GPU days.
arXiv Detail & Related papers (2025-02-19T17:16:27Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Regularized Neural Ensemblers [55.15643209328513]
In this study, we explore employing regularized neural networks as ensemble methods.<n>Motivated by the risk of learning low-diversity ensembles, we propose regularizing the ensembling model by randomly dropping base model predictions.<n>We demonstrate this approach provides lower bounds for the diversity within the ensemble, reducing overfitting and improving generalization capabilities.
arXiv Detail & Related papers (2024-10-06T15:25:39Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [74.64101864289572]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.<n>xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.<n>It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Navigating protein landscapes with a machine-learned transferable
coarse-grained model [29.252004942896875]
coarse-grained (CG) model with similar prediction performance has been a long-standing challenge.
We develop a bottom-up CG force field with chemical transferability, which can be used for extrapolative molecular dynamics on new sequences.
We demonstrate that the model successfully predicts folded structures, intermediates, metastable folded and unfolded basins, and the fluctuations of intrinsically disordered proteins.
arXiv Detail & Related papers (2023-10-27T17:10:23Z) - On the Robustness of Random Forest Against Untargeted Data Poisoning: An
Ensemble-Based Approach [42.81632484264218]
In machine learning models, perturbations of fractions of the training set (poisoning) can seriously undermine the model accuracy.
This paper aims to implement a novel hash-based ensemble approach that protects random forest against untargeted, random poisoning attacks.
arXiv Detail & Related papers (2022-09-28T11:41:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.