Fugu-MT 論文翻訳(概要): From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

論文の概要: From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

arxiv url: http://arxiv.org/abs/2606.03660v2
Date: Wed, 03 Jun 2026 14:05:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 17:40:41.63729
Title: From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
Title（参考訳）: 回答から国家へ:大規模言語モデルにおける化学推論の検証可能なプロセスレベル評価
Authors: Hongyu Guo, Hao Li, He Cao, Gongbo Zhang, Li Yuan,
Abstract要約: ChemCoTBench-V2は、構造化された検証可能な化学推論トレースの評価のための、ルール検証可能な診断ベンチマークである。分子理解、分子編集、分子最適化、反応予測にまたがっており、18の報告タスクで5,620個の評価サンプルがある。
参考スコア（独自算出の注目度）: 37.34302729762671
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasoning traces. It spans molecular understanding, molecule editing, molecular optimization, and reaction prediction, with 5,620 evaluation samples across 18 reporting tasks. Models must expose key intermediate steps in expert-designed templates, and those steps are checked with deterministic chemistry rules and, for closed-answer tasks, reference traces rather than another LLM judge. Open-ended molecular optimization is evaluated with oracle-verifiable state constraints rather than strict trace matching. The benchmark reports three separate signals: final-answer correctness, template adherence, and step-wise verifier correctness over expert-refined intermediate commitments. Experiments on frontier models reveal a persistent gap between final-answer success and structured-reasoning-state consistency: models often follow the requested format while failing chemical-step checks, or answer correctly with weak supporting reasoning. ChemCoTBench-V2 enables fine-grained model comparison and identifies the concrete step at which the trace first violates the verifier.
Abstract（参考訳）: 大規模言語モデルは化学アシスタントとしてますます使われているが、ほとんどの化学ベンチマークでは最終的な答えしか得られていない。モデルは正しい分子、生成物、オプションを出力するが、その推論は化学論理に反する。既存のプロセスレベルの評価器は、LCM判断器と人間のステップレベルのプロセスアノテーションはコストが高く、一貫性がなく、幻覚に弱いため、スケールが難しい。そこで我々は,ChemCoTBench-V2という,安価で監査可能な,構造化された,検証可能な化学推論トレース評価のためのルール検証型診断ベンチマークを紹介した。分子理解、分子編集、分子最適化、反応予測にまたがっており、18の報告タスクで5,620個の評価サンプルがある。モデルは、専門家が設計したテンプレートにおいて重要な中間ステップを公開し、これらのステップは決定論的化学規則でチェックされ、クローズド・アンサータスクでは、他のLCM審査員よりも参照トレースが使用される。オープンエンド分子最適化は、厳密なトレースマッチングではなく、オラクルが検証可能な状態制約を用いて評価される。ベンチマークでは、ファイナルアンサーの正当性、テンプレートの正当性、およびエキスパートの修正した中間コミットメントに対するステップワイズ検証の正当性という3つの信号が報告されている。フロンティアモデルの実験では、最終回答の成功と構造化された状態整合性の間に永続的なギャップが示され、モデルはしばしば要求された形式に従うが、化学ステップチェックは失敗する。 ChemCoTBench-V2は、詳細なモデル比較を可能にし、トレースが検証に違反する具体的なステップを特定する。

論文の概要: From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

関連論文リスト