Fugu-MT 論文翻訳(概要): Training Text-to-Molecule Models with Context-Aware Tokenization

論文の概要: Training Text-to-Molecule Models with Context-Aware Tokenization

arxiv url: http://arxiv.org/abs/2509.04476v1
Date: Sat, 30 Aug 2025 07:59:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-08 14:27:25.313598
Title: Training Text-to-Molecule Models with Context-Aware Tokenization
Title（参考訳）: 文脈認識トークン化を用いたテキスト・分子モデルの学習
Authors: Seojin Kim, Hyeontae Song, Jaehyun Nam, Jinwoo Shin,
Abstract要約: 我々は、文脈認識分子T5(CAMT5)という新しいテキスト・分子モデルを提案する。分子構造を理解する上でのサブストラクチャーレベルのコンテキストの重要性に着想を得て,テキストから分子モデルへのサブストラクチャーレベルのトークン化を導入する。我々は、重要なサブ構造を優先し、CAMT5が分子意味をよりよく捉えられるように、重要度に基づくトレーニング戦略を開発する。
参考スコア（独自算出の注目度）: 48.35188892892129
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, text-to-molecule models have shown great potential across various chemical applications, e.g., drug-discovery. These models adapt language models to molecular data by representing molecules as sequences of atoms. However, they rely on atom-level tokenizations, which primarily focus on modeling local connectivity, thereby limiting the ability of models to capture the global structural context within molecules. To tackle this issue, we propose a novel text-to-molecule model, coined Context-Aware Molecular T5 (CAMT5). Inspired by the significance of the substructure-level contexts in understanding molecule structures, e.g., ring systems, we introduce substructure-level tokenization for text-to-molecule models. Building on our tokenization scheme, we develop an importance-based training strategy that prioritizes key substructures, enabling CAMT5 to better capture the molecular semantics. Extensive experiments verify the superiority of CAMT5 in various text-to-molecule generation tasks. Intriguingly, we find that CAMT5 outperforms the state-of-the-art methods using only 2% of training tokens. In addition, we propose a simple yet effective ensemble strategy that aggregates the outputs of text-to-molecule models to further boost the generation performance. Code is available at https://github.com/Songhyeontae/CAMT5.git.
Abstract（参考訳）: 近年、テキスト・ツー・分子モデルでは、薬物発見など、様々な化学応用において大きな可能性を示している。これらのモデルは、分子を原子の配列として表現することで、言語モデルを分子データに適用する。しかし、それらは主に局所的な接続をモデル化することに焦点を当てた原子レベルのトークン化に依存しており、それによってモデルが分子内のグローバルな構造的コンテキストを捉える能力を制限する。そこで本研究では、コンテキスト認識分子T5(CAMT5)と呼ばれる新しいテキスト・分子モデルを提案する。分子構造を理解する上でのサブストラクチャーレベルのコンテキストの重要性,例えばリングシステムに着想を得て,テキストから分子モデルへのサブストラクチャーレベルのトークン化を導入する。トークン化方式に基づいて、重要なサブ構造を優先し、CAMT5により分子意味をよりよく把握できる重要度に基づくトレーニング戦略を開発する。様々なテキストから分子生成タスクにおけるCAMT5の優位性を検証する。興味深いことに、CAMT5はトレーニングトークンの2%しか使用せず、最先端の手法よりも優れています。さらに,テキストから分子モデルへの出力を集約して生成性能をさらに向上する,シンプルで効果的なアンサンブル戦略を提案する。コードはhttps://github.com/Songhyeontae/CAMT5.gitで入手できる。

論文の概要: Training Text-to-Molecule Models with Context-Aware Tokenization

関連論文リスト