Fugu-MT 論文翻訳(概要): GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

論文の概要: GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

arxiv url: http://arxiv.org/abs/2509.25160v1
Date: Mon, 29 Sep 2025 17:57:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.145235
Title: GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts
Title（参考訳）: GSM8K-V: 視覚的言語モデルによる視覚的文脈における小学校数学語問題解法
Authors: Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang,
Abstract要約: GSM8K-Vは、純粋に視覚的なマルチイメージの数学的推論ベンチマークである。 GSM8K-Vは、広く使われているテキストベースのGSM8Kからサンプルを視覚的にマッピングすることで構築される。我々は,GSM8K-V上で,幅広いオープンソースおよびクローズドソースモデルを評価する。
参考スコア（独自算出の注目度）: 59.508903852810796
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.
Abstract（参考訳）: 視覚言語モデル(VLM)はイメージとテキストの統一的なモデリングを実現し、知覚、計画、推論を通じて複雑な現実世界のタスクを実現できる。これらのタスクの中で、推論は特に代表的であり、数学的推論は顕著な例である。画像中の数学的情報を理解し、洗練された推論を行うVLMの高レベルな能力を強調している。近年、多くの視覚数学的推論ベンチマークが提案されているが、しばしば幾何学に制限され、数学用語の問題の範囲が不足しており、複数の画像にまたがる推論を評価することは稀である。これらのギャップに対処するため、GSM8K-Vは純粋に視覚的なマルチイメージの数学的推論ベンチマークである。 GSM8K-Vは、広く使われているテキストベースのGSM8Kからサンプルを視覚的にマッピングすることで構築される。慎重に設計された画像生成パイプラインと繊細な人間のアノテーションを組み合わせることで、1,319の高品質なサンプルをキュレートする。我々は,GSM8K-V上で,幅広いオープンソースおよびクローズドソースモデルを評価する。その結果,既存のVLMはテキストベースGSM8Kではほぼ飽和状態にあるものの,GSM8K-Vでは改善の余地が残っていることがわかった。例えば、最高のパフォーマンスモデルであるGemini-2.5-Proは、GSM8Kでは95.22%の精度を達成しているが、GSM8K-Vでは46.93%しか達成していない。我々はGSM8K-Vの包括的分析を行い、現在のモデルの限界と改善のための潜在的な方向について検討する。 GSM8K-Vは、視覚数学的推論の新しい視点を提供し、より堅牢で一般化可能なVLMの開発を導くためのベンチマークを確立する。

論文の概要: GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

関連論文リスト