Fugu-MT 論文翻訳(概要): JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

論文の概要: JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

arxiv url: http://arxiv.org/abs/2512.14620v1
Date: Tue, 16 Dec 2025 17:33:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-17 16:49:26.81668
Title: JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction
Title（参考訳）: JMMMU-Pro:バイブベンチマーク構築による画像に基づく日本語多分野マルチモーダル理解ベンチマーク
Authors: Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa,
Abstract要約: 本稿では,JMMMU-Proについて紹介する。画像生成モデル(例えばNano Banana Pro)が候補となる視覚的質問を生成する手法であるVibe Benchmark Constructionを提案する。高品質なベンチマークを低コストで構築し、幅広い背景設計とレイアウト設計を網羅する。
参考スコア（独自算出の注目度）: 31.189322858209948
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.
Abstract（参考訳）: 本稿では,JMMMU-Proと拡張性のある構築手法であるVibe Benchmark Constructionを紹介する。 MMMUからMMMU-Proへの進化に続いて、JMMMU-Proは、質問画像と質問テキストを単一の画像に構成することでJMMMUを拡張し、視覚知覚による統合的な視覚的テキスト理解を必要とするベンチマークを作成する。 JMMMU-Proを構築するために,画像生成モデル(例えばNano Banana Pro)が候補となる視覚的質問を生成する手法であるVibe Benchmark Constructionを提案する。そこで,Nano Banana Proの高精細な画像生成機能と,清潔な日本語テキストを埋め込む機能を活用して,さまざまな背景やレイアウトを網羅した高品質なベンチマークを低コストで構築する。実験の結果,オープンソース LMM は JMMMU-Pro とほぼ競合し,JMMMU-Pro をオープンソースコミュニティにおける今後の取り組みを導く重要なベンチマークとして位置づけている。 JMMMU-Proは、LMMの日本語能力を評価するためのより厳密な評価ツールであり、私たちのVibe Benchmark Constructionは、画像ベースのVQAベンチマークの開発のための効率的なガイドラインも提供すると考えている。

論文の概要: JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

関連論文リスト