Fugu-MT 論文翻訳(概要): Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

論文の概要: Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

arxiv url: http://arxiv.org/abs/2506.09736v1
Date: Wed, 11 Jun 2025 13:39:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-13 06:35:03.015455
Title: Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning
Title（参考訳）: Vision Matters:単純な視覚摂動はマルチモーダルな数学推論を促進する
Authors: Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Linghe Kong, Lichao Sun, Weiran Huang,
Abstract要約: 言語のみのモデルでは、生の視覚入力を消費するMLLMと同等またはそれ以上の性能が得られることを示す。そこで我々は,アルゴリズムの修正を必要とせず,知覚の堅牢性を高めるシンプルな視覚摂動フレームワークを提案する。本研究は,マルチモーダル数学的推論における視覚摂動の重要性を明らかにするものである。
参考スコア（独自算出の注目度）: 20.632248864242968
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at https://github.com/YutingLi0606/Vision-Matters.
Abstract（参考訳）: MLLM(Multimodal large language model)の急速な進歩にもかかわらず、視覚処理の重要性はほとんど見過ごされている。単純だが明らかな実験では、画像キャプションを備えた言語のみのモデルでは、生の視覚入力を消費するMLLMと同等またはそれ以上の性能が得られることが興味深い。これは、現在のMLLMが正確な視覚的記述を生成するが、推論中に効果的に統合できないことを示唆している。そこで本研究では,アルゴリズムの変更や学習データの追加を必要とせず,知覚の堅牢性を高めるシンプルな視覚摂動フレームワークを提案する。提案手法では,SFT,DPO,GRPOなどの既存のトレーニング後パイプラインに容易に統合可能な,トラクタ結合,ドミナンス保存ミックスアップ,ランダム回転の3つの目標摂動を導入する。複数のデータセットにわたる広範な実験を通じて、アルゴリズム的な変化によって達成されたものと同等のゲインで、数学的推論性能が一貫した改善を実証する。さらに,視覚摂動を用いたQwen2.5-VL-7Bのトレーニングにより,オープンソースの7B RLチューニングモデル間の競合性能を実現する。包括的アブレーション研究を通じて、異なる摂動戦略の有効性を分析し、それぞれの摂動タイプが視覚的推論の異なる側面に一意に寄与することを明らかにする。本研究は,マルチモーダルな数学的推論において視覚摂動が重要な役割を担っていることを明らかにする。私たちのコードはhttps://github.com/YutingLi0606/Vision-Matters.comで利用可能です。

論文の概要: Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

関連論文リスト