MolXPT: Wrapping Molecules with Text for Generative Pre-training
- URL: http://arxiv.org/abs/2305.10688v2
- Date: Fri, 26 May 2023 04:35:46 GMT
- Title: MolXPT: Wrapping Molecules with Text for Generative Pre-training
- Authors: Zequn Liu, Wei Zhang, Yingce Xia, Lijun Wu, Shufang Xie, Tao Qin, Ming
Zhang and Tie-Yan Liu
- Abstract summary: MolXPT is a unified language model of text and molecules pre-trained on SMILES wrapped by text.
MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet.
- Score: 141.0924452870112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative pre-trained Transformer (GPT) has demonstrates its great success
in natural language processing and related techniques have been adapted into
molecular modeling. Considering that text is the most important record for
scientific discovery, in this paper, we propose MolXPT, a unified language
model of text and molecules pre-trained on SMILES (a sequence representation of
molecules) wrapped by text. Briefly, we detect the molecule names in each
sequence and replace them to the corresponding SMILES. In this way, the SMILES
could leverage the information from surrounding text, and vice versa. The above
wrapped sequences, text sequences from PubMed and SMILES sequences from PubChem
are all fed into a language model for pre-training. Experimental results
demonstrate that MolXPT outperforms strong baselines of molecular property
prediction on MoleculeNet, performs comparably to the best model in
text-molecule translation while using less than half of its parameters, and
enables zero-shot molecular generation without finetuning.
Related papers
- LDMol: Text-Conditioned Molecule Diffusion Model Leveraging Chemically Informative Latent Space [55.5427001668863]
We present a novel latent diffusion model dubbed LDMol, which enables a natural text-conditioned molecule generation.
Specifically, LDMol is composed of three building blocks: a molecule encoder that produces a chemically informative feature space, a natural language-conditioned latent diffusion model using a Diffusion Transformer (DiT), and an autoregressive decoder for molecule regressive.
arXiv Detail & Related papers (2024-05-28T04:59:13Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [50.756644656847165]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Text-Guided Molecule Generation with Diffusion Language Model [23.170313481324598]
We propose the Text-Guided Molecule Generation with Diffusion Language Model (TGM-DLM)
TGM-DLM updates token embeddings within the SMILES string collectively and iteratively, using a two-phase diffusion generation process.
We demonstrate that TGM-DLM outperforms MolT5-Base, an autoregressive model, without the need for additional data resources.
arXiv Detail & Related papers (2024-02-20T14:29:02Z) - GPT-MolBERTa: GPT Molecular Features Language Model for molecular
property prediction [6.349503549199403]
We present GPT-MolBERTa, a self-supervised large language model (LLM) which uses detailed textual descriptions of molecules to predict their properties.
A text based description of 326000 molecules were collected using ChatGPT and used to train LLM to learn the representation of molecules.
Experiments show that GPT-MolBERTa performs well on various molecule property benchmarks, and approaching state of the art performance in regression tasks.
arXiv Detail & Related papers (2023-09-20T17:21:43Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - Multi-modal Molecule Structure-text Model for Text-based Retrieval and
Editing [107.49804059269212]
We present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions.
In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts.
arXiv Detail & Related papers (2022-12-21T06:18:31Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - MolScribe: Robust Molecular Structure Recognition with Image-To-Graph
Generation [28.93523736883784]
MolScribe is an image-to-graph model that explicitly predicts atoms and bonds, along with their geometric layouts, to construct the molecular structure.
MolScribe significantly outperforms previous models, achieving 76-93% accuracy on public benchmarks.
arXiv Detail & Related papers (2022-05-28T03:03:45Z) - Translation between Molecules and Natural Language [43.518805086280466]
We present a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings.
$textbfMolT5$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation.
arXiv Detail & Related papers (2022-04-25T17:48:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.