Related papers: Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language

Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language

URL: http://arxiv.org/abs/2502.06634v1
Date: Mon, 10 Feb 2025 16:29:21 GMT
Title: Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language
Authors: Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin,
Abstract summary: This paper introduces LA$3$, a Language-based Automatic Augmentation framework that leverages large language models to augment existing datasets.<n>We demonstrate the effectiveness of LA$3$ by creating an enhanced dataset, LaChEBI-20, where we rewrite the annotations of molecules from an established dataset.<n>We train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations.
Score: 7.458295743918249
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA$^3$ by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations. Experimental results on text-based *de novo* molecule generation and molecule captioning demonstrate that LaMolT5 outperforms state-of-the-art models. Notably, incorporating LA$^3$ leads to improvements of up to 301% over the benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$ notable applications in *image*, *text* and *graph* tasks, affirming its versatility and utility.

Related papers

Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z)
Instruction-Based Molecular Graph Generation with Unified Text-Graph Diffusion Model [22.368332915420606]
Unified Text-Graph Diffusion Model (UTGDiff) is a framework to generate molecular graphs from instructions. UTGDiff features a unified text-graph transformer as the denoising network, derived from pre-trained language models. Our experimental results demonstrate that UTGDiff consistently outperforms sequence-based baselines in tasks involving instruction-based molecule generation and editing.
arXiv Detail & Related papers (2024-08-19T11:09:15Z)
Crossing New Frontiers: Knowledge-Augmented Large Language Model Prompting for Zero-Shot Text-Based De Novo Molecule Design [0.0]
Our study explores the use of knowledge-augmented prompting of large language models (LLMs) for the zero-shot text-conditional de novo molecular generation task. Our framework proves effective, outperforming state-of-the-art (SOTA) baseline models on benchmark datasets.
arXiv Detail & Related papers (2024-08-18T11:37:19Z)
HIGHT: Hierarchical Graph Tokenization for Molecule-Language Alignment [41.75926736949724]
We propose a graph tokenizer that encodes the hierarchy of atom, motif, and molecular levels of informative tokens to improve the molecular perception of large language models (LLMs)<n>Experiments on 14 real-world benchmarks verify the effectiveness of HIGHT in reducing hallucination by 40%, and significant improvements in various molecule-language downstream tasks.
arXiv Detail & Related papers (2024-06-20T06:37:35Z)
LDMol: A Text-to-Molecule Diffusion Model with Structurally Informative Latent Space Surpasses AR Models [55.5427001668863]
We present a novel latent diffusion model dubbed LDMol for text-conditioned molecule generation.<n> Experiments show that LDMol outperforms the existing autoregressive baselines on the text-to-molecule generation benchmark.<n>We show that LDMol can be applied to downstream tasks such as molecule-to-text retrieval and text-guided molecule editing.
arXiv Detail & Related papers (2024-05-28T04:59:13Z)
Data-Efficient Molecular Generation with Hierarchical Textual Inversion [48.816943690420224]
We introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method. HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution. Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution.
arXiv Detail & Related papers (2024-05-05T08:35:23Z)
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z)
GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text [25.979382232281786]
We introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity.
arXiv Detail & Related papers (2023-08-14T03:12:29Z)
Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations [68.32093648671496]
We introduce GODE, which accounts for the dual-level structure inherent in molecules. Molecules possess an intrinsic graph structure and simultaneously function as nodes within a broader molecular knowledge graph. By pre-training two GNNs on different graph structures, GODE effectively fuses molecular structures with their corresponding knowledge graph substructures.
arXiv Detail & Related papers (2023-06-02T15:49:45Z)
GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning [71.89623260998934]
This study investigates the feasibility of employing natural language instructions to accomplish molecule-related tasks in a zero-shot setting. Existing molecule-text models perform poorly in this setting due to inadequate treatment of instructions and limited capacity for graphs. We propose GIMLET, which unifies language models for both graph and text data.
arXiv Detail & Related papers (2023-05-28T18:27:59Z)
Translation between Molecules and Natural Language [43.518805086280466]
We present a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. $textbfMolT5$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation.
arXiv Detail & Related papers (2022-04-25T17:48:09Z)
Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning. GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data. We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.