GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins
- URL: http://arxiv.org/abs/2510.00774v1
- Date: Wed, 01 Oct 2025 11:13:53 GMT
- Title: GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins
- Authors: Eoin Quinn, Marco Carobene, Jean Quentin, Sebastien Boyer, Miguel ArbesĂș, Oliver Bent,
- Abstract summary: We introduce GeoGraph, a simulation-informed surrogate trained to predict ensemble-averaged statistics of residue-residue contact-map topology directly from sequence.<n>By featurizing coarse-grained molecular dynamics simulations into residue- and sequence-level graph descriptors, we create a robust and information-rich learning target.
- Score: 0.43981305860983716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While deep learning has revolutionized the prediction of rigid protein structures, modelling the conformational ensembles of Intrinsically Disordered Proteins (IDPs) remains a key frontier. Current AI paradigms present a trade-off: Protein Language Models (PLMs) capture evolutionary statistics but lack explicit physical grounding, while generative models trained to model full ensembles are computationally expensive. In this work we critically assess these limits and propose a path forward. We introduce GeoGraph, a simulation-informed surrogate trained to predict ensemble-averaged statistics of residue-residue contact-map topology directly from sequence. By featurizing coarse-grained molecular dynamics simulations into residue- and sequence-level graph descriptors, we create a robust and information-rich learning target. Our evaluation demonstrates that this approach yields representations that are more predictive of key biophysical properties than existing methods.
Related papers
- Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles [74.32932832937618]
We introduce $textbfRigidSSL$ ($textitRigidity-Aware Self-Supervised Learning$), a geometric pretraining framework.<n>Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations.<n>Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions.
arXiv Detail & Related papers (2026-03-02T21:32:30Z) - SIGMA: Scalable Spectral Insights for LLM Collapse [51.863164847253366]
We introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework for model collapse.<n>By utilizing benchmarks that deriving and deterministic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space.<n>We demonstrate that SIGMA effectively captures the transition towards states, offering both theoretical insights into the mechanics of collapse.
arXiv Detail & Related papers (2026-01-06T19:47:11Z) - Machine-learning-enabled interpretation of tribological deformation patterns in large-scale MD data [0.0]
Grain-orientation-colored computational tomograph pictures obtained from CuNi alloy simulations were first compressed through an autoencoder to a 32-dimensional global feature vector.<n>The reconstructed images retained the essential microstructural motifs: grain boundaries, stacking faults, twins, and partial lattice rotations, while omitting only the finest defects.<n>A CNN-MLP model to predict the dominant deformation pattern achieves a prediction accuracy of approximately 96% on validation data.
arXiv Detail & Related papers (2025-12-05T15:39:13Z) - AI-based Methods for Simulating, Sampling, and Predicting Protein Ensembles [16.635202322160243]
Review highlights recent research directions towards AI-based predictions of protein ensembles.<n>An emphasis is placed on realistic assessments of the technological maturity of current methods.<n>We advocate for "closing the loop" between model training, simulation, and inference to overcome challenges in training data availability.
arXiv Detail & Related papers (2025-09-21T20:14:45Z) - GeoRecon: Graph-Level Representation Learning for 3D Molecules via Reconstruction-Based Pretraining [19.398985037052224]
We present GeoRecon, a graph-level pretraining framework for molecular representation learning.<n>During pretraining, the model is trained to generate a graph representation capable of accurately guiding reconstruction of the molecular geometry.<n>GeoRecon outperforms node-centric baselines on multiple molecular benchmarks.
arXiv Detail & Related papers (2025-06-16T07:35:49Z) - Learning conformational ensembles of proteins based on backbone geometry [1.1874952582465603]
We propose a flow matching model for sampling protein conformations based solely on backbone geometry.<n>The resulting model is orders of magnitudes faster than current state-of-the-art approaches at comparable accuracy and can be trained from scratch in a few GPU days.
arXiv Detail & Related papers (2025-02-19T17:16:27Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Graph Generation with Diffusion Mixture [57.78958552860948]
Generation of graphs is a major challenge for real-world tasks that require understanding the complex nature of their non-Euclidean structures.
We propose a generative framework that models the topology of graphs by explicitly learning the final graph structures of the diffusion process.
arXiv Detail & Related papers (2023-02-07T17:07:46Z) - Topological structure of complex predictions [15.207535648404765]
Complex prediction models such as deep learning are the output from fitting machine learning, neural networks, or AI models to a set of training data.
We use topological data analysis to transform these complex prediction models into pictures representing a topological view.
The methods scale up to large datasets across different domains and enable us to detect labeling errors in training data, understand generalization in image classification, and inspect predictions of likely pathogenic mutations in the BRCA1 gene.
arXiv Detail & Related papers (2022-07-28T19:28:05Z) - Regularization of Mixture Models for Robust Principal Graph Learning [0.0]
A regularized version of Mixture Models is proposed to learn a principal graph from a distribution of $D$-dimensional data points.
Parameters of the model are iteratively estimated through an Expectation-Maximization procedure.
arXiv Detail & Related papers (2021-06-16T18:00:02Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.