Related papers: Training Text-to-Molecule Models with Context-Aware Tokenization

Training Text-to-Molecule Models with Context-Aware Tokenization

URL: http://arxiv.org/abs/2509.04476v2
Date: Wed, 17 Sep 2025 10:53:53 GMT
Title: Training Text-to-Molecule Models with Context-Aware Tokenization
Authors: Seojin Kim, Hyeontae Song, Jaehyun Nam, Jinwoo Shin,
Abstract summary: We propose a novel text-to-molecule model, coined Context-Aware Molecular T5 (CAMT5)<n>Inspired by the significance of the substructure-level contexts in understanding molecule structures, we introduce substructure-level tokenization for text-to-molecule models.<n>We develop an importance-based training strategy that prioritizes key substructures, enabling CAMT5 to better capture the molecular semantics.
Score: 48.35188892892129
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, text-to-molecule models have shown great potential across various chemical applications, e.g., drug-discovery. These models adapt language models to molecular data by representing molecules as sequences of atoms. However, they rely on atom-level tokenizations, which primarily focus on modeling local connectivity, thereby limiting the ability of models to capture the global structural context within molecules. To tackle this issue, we propose a novel text-to-molecule model, coined Context-Aware Molecular T5 (CAMT5). Inspired by the significance of the substructure-level contexts in understanding molecule structures, e.g., ring systems, we introduce substructure-level tokenization for text-to-molecule models. Building on our tokenization scheme, we develop an importance-based training strategy that prioritizes key substructures, enabling CAMT5 to better capture the molecular semantics. Extensive experiments verify the superiority of CAMT5 in various text-to-molecule generation tasks. Intriguingly, we find that CAMT5 outperforms the state-of-the-art methods using only 2% of training tokens. In addition, we propose a simple yet effective ensemble strategy that aggregates the outputs of text-to-molecule models to further boost the generation performance. Code is available at https://github.com/Songhyeontae/CAMT5.git.

Related papers

Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding [13.814119721533508]
Molecular understanding is central to advancing areas such as scientific discovery.<n>Existing graph-LLM bridges often adapt the Q-Former-style connector with fixed-length static tokens.<n>We introduce EDT-Former, an Entropy-guided Dynamic Token Transformer that generates tokens aligned with informative molecular patches.
arXiv Detail & Related papers (2026-02-02T19:56:21Z)
Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models [43.37148291436855]
We present a two-step framework PEIT to improve large language models for molecular-related tasks.<n>In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN.<n>In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks.
arXiv Detail & Related papers (2024-12-24T01:48:07Z)
Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z)
Tokenization for Molecular Foundation Models [0.0]
We systematically evaluate 34 tokenizers, including 19 chemistry-specific ones, and reveal significant gaps in their coverage of the SMILES molecular representation.<n>We propose two new tokenizers -- Smirk and Smirk-GPE -- with full coverage of the OpenSMILES specification.
arXiv Detail & Related papers (2024-09-19T02:36:04Z)
Crossing New Frontiers: Knowledge-Augmented Large Language Model Prompting for Zero-Shot Text-Based De Novo Molecule Design [0.0]
Our study explores the use of knowledge-augmented prompting of large language models (LLMs) for the zero-shot text-conditional de novo molecular generation task. Our framework proves effective, outperforming state-of-the-art (SOTA) baseline models on benchmark datasets.
arXiv Detail & Related papers (2024-08-18T11:37:19Z)
UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation [35.277927005912275]
We introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture.<n>A Vector Quantization-driven tokenizer transforms molecules into sequences of molecule tokens with causal dependency.<n>UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks.
arXiv Detail & Related papers (2024-08-01T18:31:31Z)
Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation [42.08917809689811]
Cross-modal representation learning has emerged as a promising direction for enhancing the quality of molecular representation.<n>We propose Atomas, a hierarchical molecular representation learning framework that jointly learns representations from SMILES strings and text.<n>Atomas achieves superior performance across 12 tasks on 11 datasets, outperforming 11 baseline models.
arXiv Detail & Related papers (2024-04-23T12:35:44Z)
Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers' We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z)
AdaMR: Adaptable Molecular Representation for Unified Pre-training Strategy [11.710702202071573]
We propose a new large-scale uniform pre-training strategy for small-molecule drugs, called Molecular Adjustable Representation (AdaMR) AdaMR utilizes a granularity-adjustable molecular encoding strategy, which is accomplished through a pre-training job termed molecular canonicalization. We fine-tuned our proposed pre-trained model on six molecular property prediction tasks and two generative tasks, achieving state-of-the-art (SOTA) results on five out of eight tasks.
arXiv Detail & Related papers (2023-12-28T10:53:17Z)
MolXPT: Wrapping Molecules with Text for Generative Pre-training [141.0924452870112]
MolXPT is a unified language model of text and molecules pre-trained on SMILES wrapped by text. MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet.
arXiv Detail & Related papers (2023-05-18T03:58:19Z)
Reinforced Molecular Optimization with Neighborhood-Controlled Grammars [63.84003497770347]
We propose MNCE-RL, a graph convolutional policy network for molecular optimization. We extend the original neighborhood-controlled embedding grammars to make them applicable to molecular graph generation. We show that our approach achieves state-of-the-art performance in a diverse range of molecular optimization tasks.
arXiv Detail & Related papers (2020-11-14T05:42:15Z)
Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning. GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data. We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.