Pearl: A Foundation Model for Placing Every Atom in the Right Location
- URL: http://arxiv.org/abs/2510.24670v2
- Date: Wed, 29 Oct 2025 14:41:45 GMT
- Title: Pearl: A Foundation Model for Placing Every Atom in the Right Location
- Authors: Genesis Research Team, Alejandro Dobles, Nina Jovic, Kenneth Leidal, Pranav Murugan, David C. Williams, Drausin Wulsin, Nate Gruver, Christina X. Ji, Korrawat Pruegsanusak, Gianluca Scarpellini, Ansh Sharma, Wojciech Swiderski, Andrea Bootsma, Richard Strong Bowen, Charlotte Chen, Jamin Chen, Marc André Dämgen, Benjamin DiFrancesco, J. D. Fishman, Alla Ivanova, Zach Kagin, David Li-Bland, Zuli Liu, Igor Morozov, Jeffrey Ouyang-Zhang, Frank C. Pickard IV, Kushal S. Shah, Ben Shor, Gabriel Monteiro da Silva, Roy Tal, Maxx Tessmer, Carl Tilbury, Cyr Vetcher, Daniel Zeng, Maruan Al-Shedivat, Aleksandra Faust, Evan N. Feinberg, Michael V. LeVine, Matteus Pan,
- Abstract summary: We introduce Pearl, a foundation model for protein-ligand cofolding at scale.<n>Pearl establishes a new state-of-the-art performance in protein-ligand cofolding.<n>Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N' Poses and PoseBusters benchmarks.
- Score: 52.35027831422145
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurately predicting the three-dimensional structures of protein-ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein-ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large-scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)-equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi-chain templating system supporting both protein and non-polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state-of-the-art performance in protein-ligand cofolding. On the key metric of generating accurate (RMSD < 2 \r{A}) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N' Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket-conditional cofolding regime, Pearl delivers $3.6\times$ improvement on a proprietary set of challenging, real-world drug targets at the more rigorous RMSD < 1 \r{A} threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.
Related papers
- Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles [74.32932832937618]
We introduce $textbfRigidSSL$ ($textitRigidity-Aware Self-Supervised Learning$), a geometric pretraining framework.<n>Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations.<n>Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions.
arXiv Detail & Related papers (2026-03-02T21:32:30Z) - GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning [14.111530312590531]
Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI)<n>We propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO)<n>GDEPO incorporates three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases.
arXiv Detail & Related papers (2026-01-11T07:34:41Z) - SPACE: Noise Contrastive Estimation Stabilizes Self-Play Fine-Tuning for Large Language Models [35.53535406831892]
We introduce a novel self-play fine-tuning method, namely Self-PlAy via Noise Contrastive Estimation (SPACE)<n>SPACE treats synthetic samples as auxiliary components, and discriminates them from the real ones in a binary classification manner.<n>We show that SPACE significantly improves the performance of LLMs over various tasks, and outperforms supervised fine-tuning that employs much more real-world samples.
arXiv Detail & Related papers (2025-12-08T05:16:18Z) - Large-Scale Diverse Synthesis for Mid-Training [15.81154701009597]
BoostQA is a 100B-token large-scale question-answering dataset.<n>We propose a novel diversified pipeline to synthesize BoostQA.<n>Our method enables Llama-3 8B, mid-trained on a 40B-token dataset, to achieve an average improvement of $mathbf12.74%$ on MMLU and CMMLU.
arXiv Detail & Related papers (2025-08-02T11:37:16Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - PoseX: AI Defeats Physics Approaches on Protein-Ligand Cross Docking [74.76447568426276]
PoseX is an open-source benchmark to evaluate both self-docking and cross-docking.<n>We incorporated 23 docking methods in three methodological categories.<n>We developed a relaxation method for post-processing to minimize conformational energy and refine binding poses.
arXiv Detail & Related papers (2025-05-03T05:35:37Z) - Rao-Blackwell Gradient Estimators for Equivariant Denoising Diffusion [55.95767828747407]
In domains such as molecular and protein generation, physical systems exhibit inherent symmetries that are critical to model.<n>We present a framework that reduces training variance and provides a provably lower-variance gradient estimator.<n>We also present a practical implementation of this estimator incorporating the loss and sampling procedure through a method we call Orbit Diffusion.
arXiv Detail & Related papers (2025-02-14T03:26:57Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - One-step Structure Prediction and Screening for Protein-Ligand Complexes using Multi-Task Geometric Deep Learning [6.605588716386855]
We show that LigPose can be accurately tackled with a single model, namely LigPose, based on multi-task geometric deep learning.
LigPose represents the ligand and the protein pair as a graph, with the learning of binding strength and atomic interactions as auxiliary tasks.
Experiments show LigPose achieved state-of-the-art performance on major tasks in drug research.
arXiv Detail & Related papers (2024-08-21T05:53:50Z) - On Machine Learning Approaches for Protein-Ligand Binding Affinity Prediction [2.874893537471256]
This study evaluates the performance of classical tree-based models and advanced neural networks in protein-ligand binding affinity prediction.
We show that combining 2D and 3D model strengths improves active learning outcomes beyond current state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-15T13:06:00Z) - Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models [42.16524616409125]
In this work, we show that by pre-training on a large-scale docking conformation, we can obtain a protein-ligand structure prediction model with outstanding performance.
The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase.
arXiv Detail & Related papers (2023-10-21T05:54:26Z) - Prediction of SLAM ATE Using an Ensemble Learning Regression Model and
1-D Global Pooling of Data Characterization [3.4399698738841553]
We introduce a novel method for predicting SLAM localization error based on the characterization of raw sensor inputs.
The proposed method relies on using a random forest regression model trained on 1-D global pooled features that are generated from characterized raw sensor data.
The paper also studies the impact of 12 different 1-D global pooling functions on regression quality, and the superiority of 1-D global averaging is quantitatively proven.
arXiv Detail & Related papers (2023-03-01T16:12:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.