Fugu-MT 論文翻訳(概要): ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

論文の概要: ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

arxiv url: http://arxiv.org/abs/2510.04514v1
Date: Mon, 06 Oct 2025 06:05:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.701876
Title: ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering
Title（参考訳）: ChartAgent: 複雑なチャート質問回答における視覚的接地推論のためのマルチモーダルエージェント
Authors: Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso,
Abstract要約: 本稿では,グラフの空間領域内で直接視覚的推論を行う新しいエージェントフレームワークであるChartAgentを紹介する。我々の研究は、ツール強化マルチモーダルエージェントを用いたチャート理解のための視覚的根拠に基づく推論を初めて示すものである。
参考スコア（独自算出の注目度）: 23.455587605758396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.
Abstract（参考訳）: 近年のマルチモーダルLCMでは、チャートベースの視覚的質問応答が約束されているが、その性能は、テキストのショートカットに頼るのではなく、正確に視覚的な解釈を必要とする非注釈のチャートで著しく低下している。そこで我々は,グラフの空間領域内で直接視覚的推論を行う新しいエージェントフレームワークであるChartAgentを紹介する。テキストチェーンの推論とは異なり、ChartAgentはクエリを視覚的なサブタスクに繰り返し分解し、各サブタスクを満たすためにチャート固有の視覚ツールのライブラリを使用して、アノテーション、トリミング領域(例えば、パイスライス、アイソレーションバー)、ローカライズ軸などの特別なアクションを通じて、チャートイメージを積極的に操作し、相互作用する。この反復的推論プロセスは、チャート理解のための人間の認知戦略を密接に反映している。 ChartAgentはChartBenchとChartXのベンチマークで最先端の精度を達成し、従来の手法を16.07%まで上回り、注釈のない数値的なクエリでは17.31%となっている。さらに、我々はChartAgentについて分析した。 (a) 様々なチャートタイプで有効である。 (b)視覚的・理性的な複雑さのレベルによって最高点を達成し、 c)は、様々な基盤となるLLMのパフォーマンスを向上するプラグイン・アンド・プレイのフレームワークとして機能する。我々の研究は、ツール強化マルチモーダルエージェントを用いたチャート理解のための視覚的根拠に基づく推論を初めて示すものである。

論文の概要: ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

関連論文リスト