Fugu-MT 論文翻訳(概要): VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

論文の概要: VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

arxiv url: http://arxiv.org/abs/2603.15030v1
Date: Mon, 16 Mar 2026 09:31:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:57.985764
Title: VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining
Title（参考訳）: VTC-Bench: 合成ビジュアルツールチェインによるエージェントマルチモーダルモデルの評価
Authors: Xuanyu Zhu, Yuhao Dong, Rundong Wang, Yang Shi, Zhipeng Wu, Yinlun Peng, YiFan Zhang, Yihang Lou, Yuanxing Zhang, Ziwei Liu, Yan Bai, Yuan Zhou,
Abstract要約: VisualToolChain-Bench (VTC-Bench)を導入し,マルチモーダル大言語モデル(MLLM)におけるツール使用の習熟度を評価する。リアルなコンピュータビジョンパイプラインに合わせるために、我々のフレームワークは32種類のOpenCVベースの視覚操作を備えている。正確な評価のために、9カテゴリの認知階層にまたがる680のキュレートされた問題を提示する。
参考スコア（独自算出の注目度）: 49.78591189918702
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench~(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51\% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.
Abstract（参考訳）: 近年,マルチモーダル大規模言語モデル (MLLM) は,高度な視覚タスクに外部ツールを活用することで,標準的な視覚的質問応答を超えて拡張されている。この進歩にもかかわらず、複雑なタスクのための多様なツールを正確に実行し、効果的に構成することは、永続的なボトルネックのままである。スパースツールセットとシンプルなツール使用トラジェクトリによって制限された既存のベンチマークは、複雑で多様なツールインタラクションをキャプチャできず、実用的な実世界の条件下でのモデルパフォーマンスの評価に不足している。このギャップを埋めるため、MLLMのツール使用能力を評価するために設計された総合ベンチマークであるVisualToolChain-Bench~(VTC-Bench)を紹介する。リアルなコンピュータビジョンパイプラインに合わせるために、我々のフレームワークは32種類のOpenCVベースの視覚操作を備えている。このリッチなツールセットは広範な組み合わせを可能にし、VTC-Benchはマルチツール構成と長期のマルチステッププラン実行を厳格に評価することができる。正確な評価のために,9カテゴリの認知階層にまたがって構成された680個のキュレートされた問題に対して,それぞれが基幹的実行軌跡を持つ。 19個のMLLMの大規模な実験により、現在のモデルにおける視覚的エージェント能力の限界が明らかとなった。特に、モデルは多様なツールセットに適応し、目に見えない操作に一般化するのに苦労し、主要なモデルであるGemini-3.0-Proは、ベンチマークで51倍しか達成していません。さらに、マルチツール構成は永続的な課題である。複雑なタスクに直面している場合、モデルは最適なツールを選択するのではなく、慣れ親しんだ関数の狭い最適部分集合に大きく依存して効率的な実行計画を定式化するのに苦労する。これらの基本的な課題を特定することにより、VTC-Benchはより一般化された視覚エージェントモデルの開発を導くための厳密なベースラインを確立する。

論文の概要: VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

関連論文リスト