Multi-task Bioassay Pre-training for Protein-ligand Binding Affinity
Prediction
- URL: http://arxiv.org/abs/2306.04886v2
- Date: Wed, 20 Dec 2023 11:27:01 GMT
- Title: Multi-task Bioassay Pre-training for Protein-ligand Binding Affinity
Prediction
- Authors: Jiaxian Yan, Zhaofeng Ye, Ziyi Yang, Chengqiang Lu, Shengyu Zhang, Qi
Liu, Jiezhong Qiu
- Abstract summary: We propose Multi-task Bioassay Pre-training (MBP), a pre-training framework for structure-based PLBA prediction.
MBP learns robust and transferrable structural knowledge from our new ChEMBL-Dock dataset with varied and noisy labels.
- Score: 26.530876904939163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Protein-ligand binding affinity (PLBA) prediction is the fundamental task in
drug discovery. Recently, various deep learning-based models predict binding
affinity by incorporating the three-dimensional structure of protein-ligand
complexes as input and achieving astounding progress. However, due to the
scarcity of high-quality training data, the generalization ability of current
models is still limited. In addition, different bioassays use varying affinity
measurement labels (i.e., IC50, Ki, Kd), and different experimental conditions
inevitably introduce systematic noise, which poses a significant challenge to
constructing high-precision affinity prediction models. To address these
issues, we (1) propose Multi-task Bioassay Pre-training (MBP), a pre-training
framework for structure-based PLBA prediction; (2) construct a pre-training
dataset called ChEMBL-Dock with more than 300k experimentally measured affinity
labels and about 2.8M docked three-dimensional structures. By introducing
multi-task pre-training to treat the prediction of different affinity labels as
different tasks and classifying relative rankings between samples from the same
bioassay, MBP learns robust and transferrable structural knowledge from our new
ChEMBL-Dock dataset with varied and noisy labels. Experiments substantiate the
capability of MBP as a general framework that can improve and be tailored to
mainstream structure-based PLBA prediction tasks. To the best of our knowledge,
MBP is the first affinity pre-training model and shows great potential for
future development.
Related papers
- Investigating Data Pruning for Pretraining Biological Foundation Models at Scale [47.09153330837959]
We propose a post-hoc influence-guided data pruning framework tailored to biological domains.<n>Our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent.<n>These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining.
arXiv Detail & Related papers (2025-12-15T02:42:52Z) - Learning Discrete Bayesian Networks with Hierarchical Dirichlet Shrinkage [52.914168158222765]
We detail a comprehensive Bayesian framework for learning DBNs.<n>We give a novel Markov chain Monte Carlo (MCMC) algorithm utilizing parallel Langevin proposals to generate exact posterior samples.<n>We apply our methodology to uncover prognostic network structure from primary breast cancer samples.
arXiv Detail & Related papers (2025-09-16T17:24:35Z) - AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model [92.51919604882984]
We introduce AMix-1, a powerful protein foundation model built on Flow Bayesian Networks.<n>AMix-1 is empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm.<n>Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework.
arXiv Detail & Related papers (2025-07-11T17:02:25Z) - DISPROTBENCH: A Disorder-Aware, Task-Rich Benchmark for Evaluating Protein Structure Prediction in Realistic Biological Contexts [76.59606029593085]
DisProtBench is a benchmark for evaluating protein structure prediction models (PSPMs) under structural disorder and complex biological conditions.<n>DisProtBench spans three key axes: data complexity, task diversity, and Interpretability.<n>Results reveal significant variability in model robustness under disorder, with low-confidence regions linked to functional prediction failures.
arXiv Detail & Related papers (2025-06-18T23:58:22Z) - A Generalist Cross-Domain Molecular Learning Framework for Structure-Based Drug Discovery [32.573496601865465]
Structure-based drug discovery (SBDD) is a systematic scientific process that develops new drugs by leveraging the detailed physical structure of the target protein.
Recent advancements in pre-trained models for biomolecules have demonstrated remarkable success across various biochemical applications.
arXiv Detail & Related papers (2025-03-06T12:04:56Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [51.41246396610475]
This paper aims to predict performance in closed-book question answering (QA) without the help of external tools.<n>We conduct large-scale retrieval and semantic analysis across the pre-training corpora of 21 publicly available and 3 custom-trained large language models.<n>Building on these foundations, we propose Size-dependent Mutual Information (SMI), an information-theoretic metric that linearly correlates pre-training data characteristics.
arXiv Detail & Related papers (2025-02-06T13:23:53Z) - Binding Affinity Prediction: From Conventional to Machine Learning-Based Approaches [48.66541987908136]
Much work has been devoted to predicting binding affinity over the past decades.<n>We note growing use of both traditional machine learning and deep learning models for predicting binding affinity.<n>With improved predictive performance and the FDA's phasing out of animal testing, AI-driven in silico models, such as AI virtual cells (AIVCs), are poised to advance binding affinity prediction.
arXiv Detail & Related papers (2024-09-30T03:40:49Z) - Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions [4.36852565205713]
We present OmniBioTE, the largest open-source multi-omic model trained on over 250 billion tokens of mixed protein and nucleic acid data.<n>We show that OmbiBioTE achieves state-of-the-art results predicting the change in Gibbs free energy of the binding interaction between a given nucleic acid and protein.
arXiv Detail & Related papers (2024-08-29T03:56:40Z) - Autoregressive Enzyme Function Prediction with Multi-scale Multi-modality Fusion [11.278610817877578]
We introduce MAPred, a novel multi-modality and multi-scale model designed to autoregressively predict the EC number of proteins.
MAPred integrates both the primary amino acid sequence and the 3D tokens of proteins, employing a dual-pathway approach to capture comprehensive protein characteristics.
Evaluations on benchmark datasets, including New-392, Price, and New-815, demonstrate that our method outperforms existing models.
arXiv Detail & Related papers (2024-08-11T08:28:43Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - Protein binding affinity prediction under multiple substitutions applying eGNNs on Residue and Atomic graphs combined with Language model information: eGRAL [1.840390797252648]
Deep learning is increasingly recognized as a powerful tool capable of bridging the gap between in-silico predictions and in-vitro observations.
We propose eGRAL, a novel graph neural network architecture designed for predicting binding affinity changes from amino acid substitutions in protein complexes.
eGRAL leverages residue, atomic and evolutionary scales, thanks to features extracted from protein large language models.
arXiv Detail & Related papers (2024-05-03T10:33:19Z) - Equivariant Pretrained Transformer for Unified Geometric Learning on
Multi-Domain 3D Molecules [23.189608074493997]
Equivariant Pretrained Transformer (EPT) is a novel pretraining framework designed to harmonize the geometric learning of small molecules and proteins.
EPT unifies the geometric modeling of multi-domain molecules via the block-enhanced representation that can attend a broader context of each atom.
Another key innovation of EPT is its block-level pretraining task, which allows for joint pretraining on datasets comprising both small molecules and proteins.
arXiv Detail & Related papers (2024-02-20T04:40:00Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Progressive Multi-Modality Learning for Inverse Protein Folding [47.095862120116976]
We propose a novel protein design paradigm called MMDesign, which leverages multi-modality transfer learning.
MMDesign is the first framework that combines a pretrained structural module with a pretrained contextual module, using an auto-encoder (AE) based language model to incorporate prior protein semantic knowledge.
Experimental results, only training with the small dataset, demonstrate that MMDesign consistently outperforms baselines on various public benchmarks.
arXiv Detail & Related papers (2023-12-11T10:59:23Z) - Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models [42.16524616409125]
In this work, we show that by pre-training on a large-scale docking conformation, we can obtain a protein-ligand structure prediction model with outstanding performance.
The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase.
arXiv Detail & Related papers (2023-10-21T05:54:26Z) - Geometric Deep Learning for Structure-Based Drug Design: A Survey [83.87489798671155]
Structure-based drug design (SBDD) leverages the three-dimensional geometry of proteins to identify potential drug candidates.
Recent advancements in geometric deep learning, which effectively integrate and process 3D geometric data, have significantly propelled the field forward.
arXiv Detail & Related papers (2023-06-20T14:21:58Z) - On the Trade-off of Intra-/Inter-class Diversity for Supervised
Pre-training [72.8087629914444]
We study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset.
With the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity.
arXiv Detail & Related papers (2023-05-20T16:23:50Z) - Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation.
GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.