Fugu-MT 論文翻訳(概要): Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

論文の概要: Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

arxiv url: http://arxiv.org/abs/2605.07251v1
Date: Fri, 08 May 2026 05:19:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.812915
Title: Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
Title（参考訳）: 薬剤は反応を抑えるか? : 化学コスト推論によるLCMの評価
Authors: Yuyang Wu, Yue Huang, Shuaike Shen, Xujian Wang, Shuhao Zhang, Qiyao Xue, Weichen Liu, Runtian Gao, Jian Ma, Xiangliang Zhang, Olexandr Isayev,
Abstract要約: 大規模言語モデル(LLM)は、ツール使用エージェントとしてますます機能してきている。 ChemCostは、2,261の化学物質と230,775のサプライヤの見積もりをカバーする、凍結価格スナップショットに基づく1,427個の評価可能な反応のベンチマークである。フロンティア、オープンウェイト、化学特殊化LLMエージェントを用いた実験では、ツールアクセスは必要だが、タスクの解決には不十分であることが示されている。
参考スコア（独自算出の注目度）: 31.064444347894565
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have become increasingly capable as tool-using agents, with benchmarks spanning diverse general agentic tasks. Yet rigorous evaluation of scientific tool use remains limited. In chemistry, recent agents can plan syntheses and invoke domain-specific tools, but evaluations often rely on curated demonstrations, expert assessment, or LLM-as-judge scoring rather than exact, judge-free ground truth. We address this gap with chemical procurement cost estimation, a practical task in which an agent must ground chemical identities, retrieve supplier quotes, select valid purchasable packs, normalize quantities, and compute cost from a reaction description. We introduce ChemCost, a benchmark of 1,427 evaluable reactions grounded to a frozen pricing snapshot covering 2,261 chemicals and 230,775 supplier quotes, supporting scalar scoring and stage-level diagnosis of grounding, retrieval, procurement, and arithmetic failures. To evaluate robustness, we further construct controlled noise-injected views that perturb chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with frontier, open-weight, and chemistry-specialized LLM agents show that tool access is necessary but insufficient for solving the task. The strongest agents reach only 50.6% accuracy within 25% relative error on clean inputs and degrade substantially with realistic noise. Stage-level analysis further shows that failures arise from brittle parsing, ineffective evidence integration, invalid pack selection, and non-convergent tool use.
Abstract（参考訳）: 大規模言語モデル(LLM)は、様々な汎用エージェントタスクにまたがるベンチマークによって、ツール使用エージェントとしてますます能力を高めている。しかし、科学的ツールの使用の厳密な評価は依然として限られている。化学において、最近のエージェントは、合成を計画し、ドメイン固有のツールを呼び出すことができるが、評価は、正確な、判断自由な基底真理ではなく、キュレートされた実演、専門家評価、LCM-as-judgeスコアに依存することが多い。このギャップを化学調達コストの見積で解決し, エージェントは化学物質の同定, サプライヤの引用, 有効購入可能なパックの選択, 量正規化, および反応記述からの計算コストの計算を行なわなければならない。我々は,2,261の化学物質と230,775のサプライヤの引用を含む凍結価格スナップショットに基づく1,427の評価可能な反応のベンチマークであるChemCostを紹介し,スカラースコアとグラウンド,検索,調達,演算失敗のステージレベル診断をサポートする。さらに,ロバスト性を評価するために,化学エイリアス,量表現,フィールドの欠落,入力フォーマットを摂動する制御されたノイズ注入ビューを構築した。フロンティア、オープンウェイト、化学特殊化LDMエージェントを用いた実験では、ツールアクセスは必要だが、タスクの解決には不十分であることが示されている。最強のエージェントは、クリーンな入力の25%の相対誤差で50.6%の精度しか得られず、現実的なノイズで著しく劣化する。ステージレベルの分析では、不安定な解析、有効でないエビデンスの統合、無効なパックの選択、非収束ツールの使用から障害が発生することが示されている。

論文の概要: Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

関連論文リスト