Fugu-MT 論文翻訳(概要): SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents

論文の概要: SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents

arxiv url: http://arxiv.org/abs/2603.29139v1
Date: Tue, 31 Mar 2026 01:41:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:02.990742
Title: SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
Title（参考訳）: SciVisAgentBench: 科学的データ分析と可視化エージェントの評価ベンチマーク
Authors: Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Nathaniel Gorski, Jianxin Sun, Guoxi Liu, Helgi I. Ingolfsson, David Lenz, Hanqi Guo, Hongfeng Yu, Teja Leburu, Michael Molash, Bei Wang, Tom Peterka, Chaoli Wang, Shusen Liu,
Abstract要約: SciVisAgentBenchは、科学データ分析および可視化エージェントを評価するための基本的なベンチマークである。私たちのベンチマークは、アプリケーションドメイン、データタイプ、複雑性レベル、可視化操作の4つの次元にまたがる構造化分類に基づいています。現在、さまざまなSciVisシナリオをカバーする108のエキスパートクラフトケースで構成されている。
参考スコア（独自算出の注目度）: 12.966844873205048
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we introduce a multimodal outcome-centric evaluation pipeline that combines LLM-based judging with deterministic evaluators, including image-based metrics, code checkers, rule-based verifiers, and case-specific evaluators. We also conduct a validity study with 12 SciVis experts to examine the agreement between human and LLM judges. Using this framework, we evaluate representative SciVis agents and general-purpose coding agents to establish initial baselines and reveal capability gaps. SciVisAgentBench is designed as a living benchmark to support systematic comparison, diagnose failure modes, and drive progress in agentic SciVis. The benchmark is available at https://scivisagentbench.github.io/.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩は、自然言語の意図を実行可能な科学的可視化(SciVis)タスクに変換するエージェントシステムを可能にしている。急速な進歩にもかかわらず、コミュニティはこれらの新興SciVisエージェントを現実的で多段階の分析設定で評価するための、原則的で再現可能なベンチマークを欠いている。 SciVisAgentBenchは、科学データ分析および可視化エージェントを評価するための総合的かつ拡張可能なベンチマークである。私たちのベンチマークは、アプリケーションドメイン、データタイプ、複雑性レベル、可視化操作の4つの次元にまたがる構造化分類に基づいています。現在、さまざまなSciVisシナリオをカバーする108のエキスパートクラフトケースで構成されている。信頼性評価を実現するために,LLMに基づく判定と,画像ベースのメトリクス,コードチェッカー,ルールベースの検証器,ケース固有の評価器を含む決定論的評価器を組み合わせたマルチモーダルな結果中心評価パイプラインを導入する。また,SciVisの専門家12名を対象に,人間とLLM裁判官の合意を検証した。このフレームワークを用いて、SciVisの代表エージェントと汎用コーディングエージェントを評価し、初期ベースラインを確立し、能力ギャップを明らかにする。 SciVisAgentBenchは、システム比較をサポートし、障害モードを診断し、エージェントSciVisの進歩を促進するために、生きたベンチマークとして設計されている。ベンチマークはhttps://scivisagentbench.github.io/で公開されている。

論文の概要: SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents

関連論文リスト