Translation between Molecules and Natural Language
- URL: http://arxiv.org/abs/2204.11817v2
- Date: Tue, 26 Apr 2022 17:59:09 GMT
- Title: Translation between Molecules and Natural Language
- Authors: Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Heng Ji
- Abstract summary: We present a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings.
$textbfMolT5$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation.
- Score: 43.518805086280466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Joint representations between images and text have been deeply investigated
in the literature. In computer vision, the benefits of incorporating natural
language have become clear for enabling semantic-level control of images. In
this work, we present $\textbf{MolT5}-$a self-supervised learning framework for
pretraining models on a vast amount of unlabeled natural language text and
molecule strings. $\textbf{MolT5}$ allows for new, useful, and challenging
analogs of traditional vision-language tasks, such as molecule captioning and
text-based de novo molecule generation (altogether: translation between
molecules and language), which we explore for the first time. Furthermore,
since $\textbf{MolT5}$ pretrains models on single-modal data, it helps overcome
the chemistry domain shortcoming of data scarcity. Additionally, we consider
several metrics, including a new cross-modal embedding-based metric, to
evaluate the tasks of molecule captioning and text-based molecule generation.
By interfacing molecules with natural language, we enable a higher semantic
level of control over molecule discovery and understanding--a critical task for
scientific domains such as drug discovery and material design. Our results show
that $\textbf{MolT5}$-based models are able to generate outputs, both molecule
and text, which in many cases are high quality and match the input modality. On
molecule generation, our best model achieves 30% exact matching test accuracy
(i.e., it generates the correct structure for about one-third of the captions
in our held-out test set).
Related papers
- UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation [35.51027934845928]
We introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture.
A Vector Quantization-driven tokenizer transforms molecules into sequences of molecule tokens with causal dependency.
UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks.
arXiv Detail & Related papers (2024-08-01T18:31:31Z) - Data-Efficient Molecular Generation with Hierarchical Textual Inversion [48.816943690420224]
We introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method.
HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution.
Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution.
arXiv Detail & Related papers (2024-05-05T08:35:23Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - GIT-Mol: A Multi-modal Large Language Model for Molecular Science with
Graph, Image, and Text [25.979382232281786]
We introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information.
We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity.
arXiv Detail & Related papers (2023-08-14T03:12:29Z) - Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules.
To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z) - MolXPT: Wrapping Molecules with Text for Generative Pre-training [141.0924452870112]
MolXPT is a unified language model of text and molecules pre-trained on SMILES wrapped by text.
MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet.
arXiv Detail & Related papers (2023-05-18T03:58:19Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.