GPT-MolBERTa: GPT Molecular Features Language Model for molecular
property prediction
- URL: http://arxiv.org/abs/2310.03030v2
- Date: Tue, 10 Oct 2023 17:30:04 GMT
- Title: GPT-MolBERTa: GPT Molecular Features Language Model for molecular
property prediction
- Authors: Suryanarayanan Balaji and Rishikesh Magar and Yayati Jadhav and Amir
Barati Farimani
- Abstract summary: We present GPT-MolBERTa, a self-supervised large language model (LLM) which uses detailed textual descriptions of molecules to predict their properties.
A text based description of 326000 molecules were collected using ChatGPT and used to train LLM to learn the representation of molecules.
Experiments show that GPT-MolBERTa performs well on various molecule property benchmarks, and approaching state of the art performance in regression tasks.
- Score: 6.349503549199403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the emergence of Transformer architectures and their powerful
understanding of textual data, a new horizon has opened up to predict the
molecular properties based on text description. While SMILES are the most
common form of representation, they are lacking robustness, rich information
and canonicity, which limit their effectiveness in becoming generalizable
representations. Here, we present GPT-MolBERTa, a self-supervised large
language model (LLM) which uses detailed textual descriptions of molecules to
predict their properties. A text based description of 326000 molecules were
collected using ChatGPT and used to train LLM to learn the representation of
molecules. To predict the properties for the downstream tasks, both BERT and
RoBERTa models were used in the finetuning stage. Experiments show that
GPT-MolBERTa performs well on various molecule property benchmarks, and
approaching state of the art performance in regression tasks. Additionally,
further analysis of the attention mechanisms show that GPT-MolBERTa is able to
pick up important information from the input textual data, displaying the
interpretability of the model.
Related papers
- Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [50.756644656847165]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Extracting Molecular Properties from Natural Language with Multimodal
Contrastive Learning [1.3717673827807508]
We study how molecular property information can be transferred from natural language to graph representations.
We implement neural relevance scoring strategies to improve text retrieval, introduce a novel chemically-valid molecular graph augmentation strategy.
We achieve a +4.26% AUROC gain versus models pre-trained on the graph modality alone, and a +1.54% gain compared to recently proposed molecular graph/text contrastively trained MoMu model.
arXiv Detail & Related papers (2023-07-22T10:32:58Z) - Can Large Language Models Empower Molecular Property Prediction? [16.5246941211725]
Molecular property prediction has gained significant attention due to its transformative potential in scientific disciplines.
Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP.
In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules.
arXiv Detail & Related papers (2023-07-14T16:06:42Z) - GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule
Zero-Shot Learning [71.89623260998934]
This study investigates the feasibility of employing natural language instructions to accomplish molecule-related tasks in a zero-shot setting.
Existing molecule-text models perform poorly in this setting due to inadequate treatment of instructions and limited capacity for graphs.
We propose GIMLET, which unifies language models for both graph and text data.
arXiv Detail & Related papers (2023-05-28T18:27:59Z) - MolXPT: Wrapping Molecules with Text for Generative Pre-training [141.0924452870112]
MolXPT is a unified language model of text and molecules pre-trained on SMILES wrapped by text.
MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet.
arXiv Detail & Related papers (2023-05-18T03:58:19Z) - MolCPT: Molecule Continuous Prompt Tuning to Generalize Molecular
Representation Learning [77.31492888819935]
We propose a novel paradigm of "pre-train, prompt, fine-tune" for molecular representation learning, named molecule continuous prompt tuning (MolCPT)
MolCPT defines a motif prompting function that uses the pre-trained model to project the standalone input into an expressive prompt.
Experiments on several benchmark datasets show that MolCPT efficiently generalizes pre-trained GNNs for molecular property prediction.
arXiv Detail & Related papers (2022-12-20T19:32:30Z) - Graph neural networks for the prediction of molecular structure-property
relationships [59.11160990637615]
Graph neural networks (GNNs) are a novel machine learning method that directly work on the molecular graph.
GNNs allow to learn properties in an end-to-end fashion, thereby avoiding the need for informative descriptors.
We describe the fundamentals of GNNs and demonstrate the application of GNNs via two examples for molecular property prediction.
arXiv Detail & Related papers (2022-07-25T11:30:44Z) - KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular
Property Prediction [13.55018269009361]
We introduce Knowledge-guided Pre-training of Graph Transformer (KPGT), a novel self-supervised learning framework for molecular graph representation learning.
KPGT can offer superior performance over current state-of-the-art methods on several molecular property prediction tasks.
arXiv Detail & Related papers (2022-06-02T08:22:14Z) - Do Large Scale Molecular Language Representations Capture Important
Structural Information? [31.76876206167457]
We present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer.
Experiments show that the learned molecular representation performs competitively, when compared to graph-based and fingerprint-based supervised learning baselines.
arXiv Detail & Related papers (2021-06-17T14:33:55Z) - Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning.
GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data.
We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.