Fugu-MT 論文翻訳(概要): MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

論文の概要: MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

arxiv url: http://arxiv.org/abs/2510.14958v1
Date: Thu, 16 Oct 2025 17:58:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:14.994811
Title: MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning
Title（参考訳）: MathCanvas:マルチモーダルな数学的推論のための固有の視覚的連鎖
Authors: Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, Hongsheng Li,
Abstract要約: 本稿では,数学の本質的なVCoT機能を備えた大規模マルチモーダルモデルを実現するための包括的フレームワークを提案する。我々のモデルであるBAGEL-canvasは、強力なLMMベースラインよりも86%の相対的な改善を実現しています。我々の研究は、LMMにおける複雑なヒューマンライクな視覚支援推論をアンロックするためのツールキット・フレームワーク、データセット、ベンチマークを完全提供する。
参考スコア（独自算出の注目度）: 58.776297011268845
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/
Abstract（参考訳）: LLM(Large Language Models)はテキスト推論において優れているが、本質的に視覚的補助に頼っている幾何学のような数学的領域と競合する。既存のVisual Chain-of-Thought (VCoT) へのアプローチは、しばしば厳密な外部ツールによって制限される。このギャップを埋めるために、数学に固有のVCoT機能を備えたLMM(Large Multimodal Models)を実現するために設計された総合的なフレームワークであるMathCanvasを紹介した。私たちのアプローチは2つのフェーズから構成されます。まず、ビジュアルマニピュレーションステージは、10Mキャプション・トゥ・ダイアグラムペア(MathCanvas-Imagen)と5.2Mステップ・バイ・ステップ編集トラジェクトリ(MathCanvas-Edit)からなる新規な15.2Mペアコーパス上でモデルを事前トレーニングし、ダイアグラムの生成と編集を行う。第二に、ストラテジックなビジュアルエイド推論ステージは、MathCanvas-Instructのモデルを微調整する。厳密な評価を容易にするため,我々は3K問題に挑戦するベンチマークであるMathCanvas-Benchを紹介した。このフレームワークでトレーニングされたBAGEL-Canvasは、MathCanvas-Bench上での強力なLMMベースラインよりも86%の相対的な改善を実現し、他の公開数学ベンチマークに優れた一般化を示す。我々の研究は、LMMにおける複雑なヒューマンライクな視覚支援推論をアンロックするためのツールキット・フレームワーク、データセット、ベンチマークを完全提供する。 Project Page: https://mathcanvas.github.io/

論文の概要: MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

関連論文リスト