Can Pre-trained Models Really Learn Better Molecular Representations for
AI-aided Drug Discovery?
- URL: http://arxiv.org/abs/2209.07423v1
- Date: Sun, 21 Aug 2022 10:05:25 GMT
- Title: Can Pre-trained Models Really Learn Better Molecular Representations for
AI-aided Drug Discovery?
- Authors: Ziqiao Zhang, Yatao Bian, Ailin Xie, Pengju Han, Long-Kai Huang,
Shuigeng Zhou
- Abstract summary: We propose a method named Representation-Property Relationship Analysis (RePRA) to evaluate the quality of representations extracted by the pre-trained model.
Two scores are designed to measure the generalized ACs and SH detected by RePRA.
In experiments, representations of molecules from 10 target tasks generated by 7 pre-trained models are analyzed.
- Score: 22.921555120408907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised pre-training is gaining increasingly more popularity in
AI-aided drug discovery, leading to more and more pre-trained models with the
promise that they can extract better feature representations for molecules.
Yet, the quality of learned representations have not been fully explored. In
this work, inspired by the two phenomena of Activity Cliffs (ACs) and Scaffold
Hopping (SH) in traditional Quantitative Structure-Activity Relationship (QSAR)
analysis, we propose a method named Representation-Property Relationship
Analysis (RePRA) to evaluate the quality of the representations extracted by
the pre-trained model and visualize the relationship between the
representations and properties. The concepts of ACs and SH are generalized from
the structure-activity context to the representation-property context, and the
underlying principles of RePRA are analyzed theoretically. Two scores are
designed to measure the generalized ACs and SH detected by RePRA, and therefore
the quality of representations can be evaluated. In experiments,
representations of molecules from 10 target tasks generated by 7 pre-trained
models are analyzed. The results indicate that the state-of-the-art pre-trained
models can overcome some shortcomings of canonical Extended-Connectivity
FingerPrints (ECFP), while the correlation between the basis of the
representation space and specific molecular substructures are not explicit.
Thus, some representations could be even worse than the canonical fingerprints.
Our method enables researchers to evaluate the quality of molecular
representations generated by their proposed self-supervised pre-trained models.
And our findings can guide the community to develop better pre-training
techniques to regularize the occurrence of ACs and SH.
Related papers
- Analyzing Generative Models by Manifold Entropic Metrics [8.477943884416023]
We introduce a novel set of tractable information-theoretic evaluation metrics.
We compare various normalizing flow architectures and $beta$-VAEs on the EMNIST dataset.
The most interesting finding of our experiments is a ranking of model architectures and training procedures in terms of their inductive bias to converge to aligned and disentangled representations during training.
arXiv Detail & Related papers (2024-10-25T09:35:00Z) - Revealing Multimodal Contrastive Representation Learning through Latent
Partial Causal Models [85.67870425656368]
We introduce a unified causal model specifically designed for multimodal data.
We show that multimodal contrastive representation learning excels at identifying latent coupled variables.
Experiments demonstrate the robustness of our findings, even when the assumptions are violated.
arXiv Detail & Related papers (2024-02-09T07:18:06Z) - Co-modeling the Sequential and Graphical Routes for Peptide
Representation Learning [67.66393016797181]
We propose a peptide co-modeling method, RepCon, to enhance the mutual information of representations from decoupled sequential and graphical end-to-end models.
RepCon learns to enhance the consistency of representations between positive sample pairs and to repel representations between negative pairs.
Our results demonstrate the superiority of the co-modeling approach over independent modeling, as well as the superiority of RepCon over other methods under the co-modeling framework.
arXiv Detail & Related papers (2023-10-04T16:58:25Z) - Learning disentangled representations for explainable chest X-ray
classification using Dirichlet VAEs [68.73427163074015]
This study explores the use of the Dirichlet Variational Autoencoder (DirVAE) for learning disentangled latent representations of chest X-ray (CXR) images.
The predictive capacity of multi-modal latent representations learned by DirVAE models is investigated through implementation of an auxiliary multi-label classification task.
arXiv Detail & Related papers (2023-02-06T18:10:08Z) - BARTSmiles: Generative Masked Language Models for Molecular
Representations [10.012900591467938]
We train BARTSmiles, a BART-like model with an order of magnitude more compute than previous self-supervised molecular representations.
In-depth evaluations show that BARTSmiles consistently outperforms other self-supervised representations across classification, regression, and generation tasks.
arXiv Detail & Related papers (2022-11-29T16:30:53Z) - From Distillation to Hard Negative Sampling: Making Sparse Neural IR
Models More Effective [15.542082655342476]
We build on SPLADE -- a sparse expansion-based retriever -- and show to which extent it is able to benefit from the same training improvements as dense models.
We study the link between effectiveness and efficiency, on in-domain and zero-shot settings.
arXiv Detail & Related papers (2022-05-10T08:08:43Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models.
Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z) - Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors.
We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method.
Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z) - Rethinking Generalization of Neural Models: A Named Entity Recognition
Case Study [81.11161697133095]
We take the NER task as a testbed to analyze the generalization behavior of existing models from different perspectives.
Experiments with in-depth analyses diagnose the bottleneck of existing neural NER models.
As a by-product of this paper, we have open-sourced a project that involves a comprehensive summary of recent NER papers.
arXiv Detail & Related papers (2020-01-12T04:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.