Fugu-MT 論文翻訳(概要): MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

論文の概要: MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2508.06009v1
Date: Fri, 08 Aug 2025 04:39:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-11 20:39:06.077428
Title: MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models
Title（参考訳）: MathReal: We Keep It Real! マルチモーダル大言語モデルにおける数学推論の評価のための実情ベンチマーク
Authors: Jun Feng, Zixin Wang, Zhentao Zhang, Yue Guo, Zhihan Zhou, Xiuyi Chen, Zhenyang Li, Dawei Yin,
Abstract要約: MathRealは2000の数学的質問と、ハンドヘルドモバイルデバイスが実際のシナリオで捉えた画像からなるデータセットである。 MathRealは5つの中核的な知識と能力のカテゴリにまたがっており、3つの質問タイプを含み、3つの困難レベルに分けられる。実世界のシナリオにおける最先端MLLMのマルチモーダル数学的推論能力を評価する。
参考スコア（独自算出の注目度）: 29.7077721906364
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、様々な既存のベンチマークにおいて視覚数学的推論において顕著な能力を示した。しかし、これらのベンチマークは主にクリーンまたは処理されたマルチモーダル入力に基づいており、現実世界の幼稚園から12年生(K-12)の教育ユーザーによって提供される画像は組み込まれていない。このギャップに対処するために、我々は2000の数学的質問とハンドヘルドモバイルデバイスが実際のシナリオで捉えた画像からなる、精巧にキュレートされたデータセットMathRealを紹介した。各質問は、質問テキストと視覚要素を含む画像である。実画像は,画質劣化,視点変動,無関係なコンテンツ干渉の3つのカテゴリに分類し,さらに14のサブカテゴリに分類する。さらに、MathRealは5つの中核的な知識と能力のカテゴリにまたがっており、3つの質問タイプを含み、3つの困難レベルに分けられる。実世界のシナリオにおける最先端MLLMのマルチモーダル数学的推論能力を総合的に評価するために,本研究では,その性能を体系的に解析できる6つの実験環境を設計する。大規模な実験を通じて,既存のMLLMの問題解決能力は,現実的な教育的文脈において大きな課題であることがわかった。これに基づいて、パフォーマンスとエラーパターンを徹底的に分析し、認識、理解、推論能力についての洞察を提供し、今後の改善に向けた方向性を概説する。データとコード:https://github.com/junfeng0288/MathReal

論文の概要: MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

関連論文リスト