Fugu-MT 論文翻訳(概要): Molecular Representations for Large Language Models

論文の概要: Molecular Representations for Large Language Models

arxiv url: http://arxiv.org/abs/2605.01822v1
Date: Sun, 03 May 2026 11:08:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.950911
Title: Molecular Representations for Large Language Models
Title（参考訳）: 大規模言語モデルのための分子表現
Authors: Nicholas T. Runcie, Fergus Imrie, Charlotte M. Deane,
Abstract要約: 大規模言語モデルのための新しい分子表現である MolJSON を紹介する。私たちはそれを5つの一般的な化学形式と比較します。分子グラフを解釈・生成するLLMの能力において,表現にまたがる有意な変動が観察された。
参考スコア（独自算出の注目度）: 11.054781534160242
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly being used to support scientific discovery. In chemistry, tasks such as reaction prediction and structure elucidation require reasoning about the structures of molecules. As such, LLM-based systems for chemistry must interact reliably with molecular structures. Most previous studies of LLMs in chemistry have used SMILES strings or IUPAC names as molecular representations; however, the suitability of these formats has not been systematically assessed. In this work, we introduce MolJSON, a novel molecular representation for LLMs, and systematically compare it with five common chemical formats. We evaluated each representation with GPT-5-nano, GPT-5-mini, GPT-5, and Claude Haiku 4.5 using a set of 78,045 questions spanning translation, shortest path, and constrained generation reasoning tasks. We observed substantial variation across representations in the ability of LLMs to interpret and generate molecular graphs, with MolJSON consistently outperforming existing formats. On translation tasks, GPT-5 achieved 71.0% accuracy when converting IUPAC names to MolJSON, compared with 43.7% when converting the same inputs to SMILES. For constrained generation, GPT-5 reached 95.3% accuracy generating MolJSON, compared with 76.3% for IUPAC and 64.0% for SMILES. As an input format for shortest-path reasoning, GPT-5 successfully answered 98.5% of questions with MolJSON, compared with 92.2% for SMILES and 82.7% for IUPAC, whilst also using fewer reasoning tokens. We observed systematic errors associated with atom count and ring complexity for SMILES strings and IUPAC names, whereas MolJSON was more robust to these failure modes. Our results show that the choice of molecular representation has a material impact on LLM performance, and that explicit molecular graph schemas, such as MolJSON, are a promising direction for LLM-based systems in chemistry.
Abstract（参考訳）: 大規模言語モデル(LLM)は、科学的な発見を支援するためにますます使われている。化学において、反応予測や構造解明のようなタスクは分子の構造についての推論を必要とする。したがって、LCMベースの化学系は分子構造と確実に相互作用する必要がある。化学におけるLLMのこれまでの研究はSMILES文字列やIUPACを分子表現として用いていたが、これらのフォーマットの適合性は体系的に評価されていない。本研究では, LLMの新規分子表現である MolJSON を導入し, 5種類の化学形式と体系的に比較する。 GPT-5-nano, GPT-5-mini, GPT-5, Claude Haiku 4.5の各表現を78,045問の翻訳, 最短経路, 制約付き生成推論タスクを用いて評価した。我々は,分子グラフの解釈と生成能力において,LLMの表現に有意な変動が見られ,MolJSONは既存のフォーマットより一貫して優れていた。 GPT-5 は IUPAC 名を MolJSON に変換する際に 71.0% の精度を達成したが、同じ入力を SMILES に変換する場合 43.7% であった。 GPT-5は95.3%の精度でモルJSONを生成し、IUPACは76.3%、SMILESは64.0%であった。最短パス推論の入力フォーマットとして、GPT-5 は MolJSON で98.5% の質問に回答し、SMILES は92.2%、IUPAC は82.7% の回答を得た。 SMILES文字列とIUPAC名に対して, 原子数とリングの複雑さに関連する系統的誤差を観察したが, MolJSONはこれらの障害モードに対してより堅牢であった。以上の結果から,分子表現の選択がLCMの性能に重要な影響を与えることが示唆され,分子グラフスキーマが化学におけるLCMベースのシステムにとって有望な方向であることが示唆された。

論文の概要: Molecular Representations for Large Language Models

関連論文リスト