L+M-24: Building a Dataset for Language + Molecules @ ACL 2024
- URL: http://arxiv.org/abs/2403.00791v2
- Date: Thu, 4 Jul 2024 17:21:48 GMT
- Title: L+M-24: Building a Dataset for Language + Molecules @ ACL 2024
- Authors: Carl Edwards, Qingyun Wang, Lawrence Zhao, Heng Ji,
- Abstract summary: We detail the $textitL+M-24$ dataset created for the Language + Molecules Workshop shared task at ACL 2024.
In particular, $textitL+M-24$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.
- Score: 46.478275217556586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.
Related papers
- MolTextNet: A Two-Million Molecule-Text Dataset for Multimodal Molecular Learning [15.083985098119202]
MolTextNet is a dataset of 2.5 million high-quality molecule-text pairs.<n>We create structured descriptions for 2.5 million molecules from ChEMBL35, with text over 10 times longer than prior datasets.
arXiv Detail & Related papers (2025-05-15T19:50:11Z) - Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language [7.458295743918249]
This paper introduces LA$3$, a Language-based Automatic Augmentation framework that leverages large language models to augment existing datasets.
We demonstrate the effectiveness of LA$3$ by creating an enhanced dataset, LaChEBI-20, where we rewrite the annotations of molecules from an established dataset.
We train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations.
arXiv Detail & Related papers (2025-02-10T16:29:21Z) - M$^{3}$-20M: A Large-Scale Multi-Modal Molecule Dataset for AI-driven Drug Design and Discovery [23.60901496004578]
M$3$-20M is 71 times more in the number of molecules than the largest existing dataset.
This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions.
arXiv Detail & Related papers (2024-12-08T03:43:07Z) - G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models [15.32011692129901]
We introduce G2T-LLM, a novel approach that uses graph-to-tree text encoding to transform graph-based molecular structures into a hierarchical text format optimized for benchmark (LLMs)
This encoding converts complex molecular graphs into tree-structured formats, such as large language models and XML, which LLMs are particularly adept at processing due to their extensive pre-training on these types of data.
Through supervised fine-tuning, G2T-LLM generates valid and coherent chemical structures, addressing common challenges like invalid outputs seen in traditional graph-based methods.
arXiv Detail & Related papers (2024-10-03T04:25:21Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - GIT-Mol: A Multi-modal Large Language Model for Molecular Science with
Graph, Image, and Text [25.979382232281786]
We introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information.
We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity.
arXiv Detail & Related papers (2023-08-14T03:12:29Z) - Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules.
To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z) - Multi-modal Molecule Structure-text Model for Text-based Retrieval and
Editing [107.49804059269212]
We present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions.
In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts.
arXiv Detail & Related papers (2022-12-21T06:18:31Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - Translation between Molecules and Natural Language [43.518805086280466]
We present a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings.
$textbfMolT5$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation.
arXiv Detail & Related papers (2022-04-25T17:48:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.