Fugu-MT 論文翻訳(概要): M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

論文の概要: M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

arxiv url: http://arxiv.org/abs/2603.08369v1
Date: Mon, 09 Mar 2026 13:32:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:16.090808
Title: M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering
Title（参考訳）: M$^3$-ACE:マルチエージェントコンテキスト工学によるマルチモーダル数学推論における視覚知覚の定式化
Authors: Peijin Xie, Zhen Xu, Bingquan Liu, Baoxun Wang,
Abstract要約: M3-ACE(M3-ACE)は、数学の推論において視覚的知覚を正すために設計された多言語コンテキストエンジニアリングフレームワークである。提案手法は,MathVisionベンチマークで89.1の新たな結果を確立し,他の関連するデータセットに対して一貫した改善を実現する。
参考スコア（独自算出の注目度）: 10.491266031106774
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multimodal large language models have recently shown promising progress in visual mathematical reasoning. However, their performance is often limited by a critical yet underexplored bottleneck: inaccurate visual perception. Through systematic analysis, we find that the most failures originate from incorrect or incomplete visual evidence extraction rather than deficiencies in reasoning capability. Moreover, models tend to remain overly confident in their initial perceptions, making standard strategies such as prompt engineering, multi-round self-reflection, or posterior guidance insufficient to reliably correct errors. To address this limitation, we propose M3-ACE, a multi-agentic context engineering framework designed to rectify visual perception in multimodal math reasoning. Instead of directly aggregating final answers, our approach decouples perception and reasoning by dynamically maintaining a shared context centered on visual evidence lists. Multiple agents collaboratively contribute complementary observations, enabling the system to expose inconsistencies and recover missing perceptual information. To support stable multi-turn collaboration, we further introduce two lightweight tools: a Summary Tool that organizes evidence from different agents into consistent, complementary, and conflicting components, and a Refine Tool that filters unreliable samples and guides iterative correction. Extensive experiments demonstrate that M3-ACE substantially improves visual mathematical reasoning performance across multiple benchmarks. Our method establishes new state-of-the-art results 89.1 on the MathVision benchmark and achieves consistent improvements on other related datasets, including MathVista and MathVerse. These results highlight the importance of perception-centric multi-agent collaboration for advancing multimodal reasoning systems.
Abstract（参考訳）: マルチモーダルな大言語モデルは近年,視覚数学的推論の進歩を期待している。しかし、それらのパフォーマンスは、しばしば批判的だが未発見のボトルネックによって制限される:不正確な視覚的知覚。系統的な分析により、最も失敗する原因は推論能力の欠陥ではなく、不正確または不完全な視覚的証拠抽出であることがわかった。さらに、モデルは初期の認識に過度に自信を保ち、迅速なエンジニアリング、複数ラウンドの自己回帰、あるいはエラーを確実に修正するための後続のガイダンスのような標準的な戦略を作る傾向にある。この制限に対処するため,マルチモーダル数学推論における視覚知覚の補正を目的としたマルチエージェントコンテキストエンジニアリングフレームワークであるM3-ACEを提案する。最終回答を直接集約するのではなく、視覚的エビデンスリストを中心とした共有コンテキストを動的に維持することにより、認識と推論を分離する。複数のエージェントが相補的な観察に協力し、不整合を露呈し、欠落した知覚情報を回復することを可能にする。安定したマルチターンコラボレーションをサポートするために,さまざまなエージェントからの証拠を一貫性のある補完的かつ矛盾するコンポーネントに整理するSlide Toolと,信頼性の低いサンプルをフィルタリングして反復修正をガイドするRefine Toolという,2つの軽量ツールを導入する。大規模な実験により、M3-ACEは複数のベンチマークで視覚数学的推論性能を大幅に改善することが示された。提案手法は,MathVisionベンチマークで89.1の新たな結果を確立し,MathVistaやMathVerseなど他の関連するデータセットに対して一貫した改善を実現する。これらの結果は,マルチモーダル推論システムにおける知覚中心型マルチエージェント協調の重要性を強調した。

論文の概要: M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

関連論文リスト