S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search
- URL: http://arxiv.org/abs/2409.07462v2
- Date: Thu, 21 Nov 2024 08:32:27 GMT
- Title: S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search
- Authors: Gengmo Zhou, Zhen Wang, Feng Yu, Guolin Ke, Zhewei Wei, Zhifeng Gao,
- Abstract summary: We propose S-MolSearch, the first framework to leverage molecular 3D information and affinity information in semi-supervised contrastive learning.
It efficiently processes both labeled and unlabeled data, training molecular structural encoders while generating soft labels for the unlabeled data.
Empirically, S-MolSearch demonstrates superior performance on widely-used benchmarks LIT-PCBA and DUD-E.
- Score: 30.071862398889774
- License:
- Abstract: Virtual Screening is an essential technique in the early phases of drug discovery, aimed at identifying promising drug candidates from vast molecular libraries. Recently, ligand-based virtual screening has garnered significant attention due to its efficacy in conducting extensive database screenings without relying on specific protein-binding site information. Obtaining binding affinity data for complexes is highly expensive, resulting in a limited amount of available data that covers a relatively small chemical space. Moreover, these datasets contain a significant amount of inconsistent noise. It is challenging to identify an inductive bias that consistently maintains the integrity of molecular activity during data augmentation. To tackle these challenges, we propose S-MolSearch, the first framework to our knowledge, that leverages molecular 3D information and affinity information in semi-supervised contrastive learning for ligand-based virtual screening. Drawing on the principles of inverse optimal transport, S-MolSearch efficiently processes both labeled and unlabeled data, training molecular structural encoders while generating soft labels for the unlabeled data. This design allows S-MolSearch to adaptively utilize unlabeled data within the learning process. Empirically, S-MolSearch demonstrates superior performance on widely-used benchmarks LIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual screening methods for AUROC, BEDROC and EF.
Related papers
- MoleculeCLA: Rethinking Molecular Benchmark via Computational Ligand-Target Binding Analysis [18.940529282539842]
We construct a large-scale and precise molecular representation dataset of approximately 140,000 small molecules.
Our dataset offers significant physicochemical interpretability to guide model development and design.
We believe this dataset will serve as a more accurate and reliable benchmark for molecular representation learning.
arXiv Detail & Related papers (2024-06-13T02:50:23Z) - Data-Efficient Molecular Generation with Hierarchical Textual Inversion [48.816943690420224]
We introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method.
HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution.
Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution.
arXiv Detail & Related papers (2024-05-05T08:35:23Z) - DrugCLIP: Contrastive Protein-Molecule Representation Learning for
Virtual Screening [16.31607535765497]
DrugCLIP is a novel contrastive learning framework for virtual screening.
It can align representations of binding protein pockets and molecules from a large quantity of pairwise data without explicit binding-affinity scores.
It significantly outperforms traditional docking and supervised learning methods on diverse virtual screening benchmarks.
arXiv Detail & Related papers (2023-10-10T07:08:35Z) - ChemVise: Maximizing Out-of-Distribution Chemical Detection with the
Novel Application of Zero-Shot Learning [60.02503434201552]
This research proposes learning approximations of complex exposures from training sets of simple ones.
We demonstrate this approach to synthetic sensor responses surprisingly improves the detection of out-of-distribution obscured chemical analytes.
arXiv Detail & Related papers (2023-02-09T20:19:57Z) - DEL-Dock: Molecular Docking-Enabled Modeling of DNA-Encoded Libraries [1.290382979353427]
We introduce a new paradigm, DEL-Dock, that combines ligand-based descriptors with 3-D spatial information from docked protein-ligand complexes.
We show that our model is capable of effectively denoising DEL count data to predict molecule enrichment scores.
arXiv Detail & Related papers (2022-11-30T22:00:24Z) - SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity
Prediction [127.43571146741984]
Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery.
wet experiments remain the most reliable method, but they are time-consuming and resource-intensive.
Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue.
We present the SSM-DTA framework, which incorporates three simple yet highly effective strategies.
arXiv Detail & Related papers (2022-06-20T14:53:25Z) - Surrogate- and invariance-boosted contrastive learning for data-scarce
applications in science [2.959890389883449]
We introduce surrogate- and invariance-boosted contrastive learning (SIB-CL), a deep learning framework which incorporates three inexpensive'' and easily obtainable auxiliary information sources to overcome data scarcity.
We demonstrate SIB-CL's effectiveness and generality on various scientific problems, e.g., predicting the density-of-states of 2D photonic crystals and solving the 3D time-independent Schrodinger equation.
arXiv Detail & Related papers (2021-10-15T23:08:24Z) - Federated Learning of Molecular Properties in a Heterogeneous Setting [79.00211946597845]
We introduce federated heterogeneous molecular learning to address these challenges.
Federated learning allows end-users to build a global model collaboratively while preserving the training data distributed over isolated clients.
FedChem should enable a new type of collaboration for improving AI in chemistry that mitigates concerns about valuable chemical data.
arXiv Detail & Related papers (2021-09-15T12:49:13Z) - Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost
Functions [80.12620331438052]
deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features.
Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets.
We argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance.
arXiv Detail & Related papers (2020-06-25T08:46:37Z) - MolTrans: Molecular Interaction Transformer for Drug Target Interaction
Prediction [68.5766865583049]
Drug target interaction (DTI) prediction is a foundational task for in silico drug discovery.
Recent years have witnessed promising progress for deep learning in DTI predictions.
We propose a Molecular Interaction Transformer (TransMol) to address these limitations.
arXiv Detail & Related papers (2020-04-23T18:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.