Fugu-MT 論文翻訳(概要): SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

論文の概要: SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

arxiv url: http://arxiv.org/abs/2605.18630v1
Date: Mon, 18 May 2026 16:34:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:50.094221
Title: SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
Title（参考訳）: SCICONVBENCH:計算科学におけるタスク定式化のためのマルチターン明確化のためのLLMのベンチマーク
Authors: Nithin Somasekharan, Youssef Hassan, Shiyao Lin, Gihan Panapitiya, Patrick Emami, Anurag Acharya, Sameera Horawalavithana, Shaowu Pan,
Abstract要約: 大規模言語モデル(LLM)は、科学的なAIとしてますます多くデプロイされている。本稿では,SCICONVBENCHを科学タスクの定式化におけるマルチターン明確化のベンチマークとして紹介する。我々のベンチマークは、構造化されたタスクとルーブリックベースの評価フレームワークをペアリングする。
参考スコア（独自算出の注目度）: 3.9311288356229057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.
Abstract（参考訳）: 大規模言語モデル(LLM)は、科学的なAIとしてますます多くデプロイされている。しかしながら、これらの評価は、科学的な問題が既に十分に評価されていると仮定するのに対し、実際的な科学的補助は、計算、分析、実験が確実に実行される前に、対話を通じて洗練されなければならない不適切なユーザー要求から始まることが多い。 SCICONVBENCHは, 流体力学, 固体力学, 材料科学, パーティル微分方程式 (PDE) の4分野にまたがる, 科学的タスクの定式化におけるマルチターン明確化のベンチマークである。 SCICONVBENCHは、欠落した情報(曖昧さ)を抽出し、内部の矛盾した情報(一貫性のある解決)を含む誤った要求を検出し修正する2つの補完機能をターゲットにしている。本ベンチマークでは,構造的タスクオントロジーとルーブリックに基づく評価フレームワークを組み合わせ,3次元にわたるLLM毎のフォーマンスを系統的に測定する。現在のフロンティアモデルは、一貫性のない解像度で比較的よく機能するが、最良のモデルでさえ流体力学における曖昧さの52.7%しか解決しない。さらに,フロンティアのLLMはサイレントな仮定や暗黙の仕様修正を頻繁に行っています。 SCICONVBENCHは、信頼できる計算科学アシスタントが必要とする上流の会話推論を評価する基盤を確立する。コードとデータはhttps://github.com/csml-rpi/SciConvBenchで確認できる。

論文の概要: SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

関連論文リスト