GPT-MolBERTa: GPT Molecular Features Language Model for molecular
property prediction
- URL: http://arxiv.org/abs/2310.03030v2
- Date: Tue, 10 Oct 2023 17:30:04 GMT
- Title: GPT-MolBERTa: GPT Molecular Features Language Model for molecular
property prediction
- Authors: Suryanarayanan Balaji and Rishikesh Magar and Yayati Jadhav and Amir
Barati Farimani
- Abstract summary: We present GPT-MolBERTa, a self-supervised large language model (LLM) which uses detailed textual descriptions of molecules to predict their properties.
A text based description of 326000 molecules were collected using ChatGPT and used to train LLM to learn the representation of molecules.
Experiments show that GPT-MolBERTa performs well on various molecule property benchmarks, and approaching state of the art performance in regression tasks.
- Score: 6.349503549199403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the emergence of Transformer architectures and their powerful
understanding of textual data, a new horizon has opened up to predict the
molecular properties based on text description. While SMILES are the most
common form of representation, they are lacking robustness, rich information
and canonicity, which limit their effectiveness in becoming generalizable
representations. Here, we present GPT-MolBERTa, a self-supervised large
language model (LLM) which uses detailed textual descriptions of molecules to
predict their properties. A text based description of 326000 molecules were
collected using ChatGPT and used to train LLM to learn the representation of
molecules. To predict the properties for the downstream tasks, both BERT and
RoBERTa models were used in the finetuning stage. Experiments show that
GPT-MolBERTa performs well on various molecule property benchmarks, and
approaching state of the art performance in regression tasks. Additionally,
further analysis of the attention mechanisms show that GPT-MolBERTa is able to
pick up important information from the input textual data, displaying the
interpretability of the model.
Related papers
- Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction [14.353313239109337]
MolTRES is a novel chemical language representation learning framework.
It incorporates generator-discriminator training, allowing the model to learn from more challenging examples.
Our model outperforms existing state-of-the-art models on popular molecular property prediction tasks.
arXiv Detail & Related papers (2024-07-09T01:14:28Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Can Large Language Models Empower Molecular Property Prediction? [16.5246941211725]
Molecular property prediction has gained significant attention due to its transformative potential in scientific disciplines.
Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP.
In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules.
arXiv Detail & Related papers (2023-07-14T16:06:42Z) - GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule
Zero-Shot Learning [71.89623260998934]
This study investigates the feasibility of employing natural language instructions to accomplish molecule-related tasks in a zero-shot setting.
Existing molecule-text models perform poorly in this setting due to inadequate treatment of instructions and limited capacity for graphs.
We propose GIMLET, which unifies language models for both graph and text data.
arXiv Detail & Related papers (2023-05-28T18:27:59Z) - MolXPT: Wrapping Molecules with Text for Generative Pre-training [141.0924452870112]
MolXPT is a unified language model of text and molecules pre-trained on SMILES wrapped by text.
MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet.
arXiv Detail & Related papers (2023-05-18T03:58:19Z) - MolCPT: Molecule Continuous Prompt Tuning to Generalize Molecular
Representation Learning [77.31492888819935]
We propose a novel paradigm of "pre-train, prompt, fine-tune" for molecular representation learning, named molecule continuous prompt tuning (MolCPT)
MolCPT defines a motif prompting function that uses the pre-trained model to project the standalone input into an expressive prompt.
Experiments on several benchmark datasets show that MolCPT efficiently generalizes pre-trained GNNs for molecular property prediction.
arXiv Detail & Related papers (2022-12-20T19:32:30Z) - KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular
Property Prediction [13.55018269009361]
We introduce Knowledge-guided Pre-training of Graph Transformer (KPGT), a novel self-supervised learning framework for molecular graph representation learning.
KPGT can offer superior performance over current state-of-the-art methods on several molecular property prediction tasks.
arXiv Detail & Related papers (2022-06-02T08:22:14Z) - Do Large Scale Molecular Language Representations Capture Important
Structural Information? [31.76876206167457]
We present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer.
Experiments show that the learned molecular representation performs competitively, when compared to graph-based and fingerprint-based supervised learning baselines.
arXiv Detail & Related papers (2021-06-17T14:33:55Z) - Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning.
GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data.
We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.