Fugu-MT 論文翻訳(概要): MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

論文の概要: MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

arxiv url: http://arxiv.org/abs/2506.23563v1
Date: Mon, 30 Jun 2025 07:14:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-01 21:27:53.949875
Title: MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
Title（参考訳）: MMReason: AGIに向けたMLLMのためのオープンソースマルチモードマルチステップ推論ベンチマーク
Authors: Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K. Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao,
Abstract要約: マルチモーダル大規模言語モデル(MLLM)の進展における推論の役割既存のMLLMベンチマークは、しばしば、長鎖推論能力の正確かつ包括的な評価において不足している。 MLLM長鎖推論能力を正確かつ包括的に評価する新しいベンチマークであるMMReasonを紹介する。
参考スコア（独自算出の注目度）: 59.196131618912005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at https://github.com/HJYao00/MMReason.
Abstract（参考訳）: 推論は、MLLM(Multimodal Large Language Models)を人工知能に進化させる上で重要な役割を担っている。しかしながら、既存のMLLMベンチマークは、(1)難易度と多様性の欠如、(2)推測可能性と記憶可能性への感受性、(3)中間推論ステップの不十分な評価という3つの重要な側面から、正確かつ包括的な長鎖推論能力の評価において不足することが多い。このギャップを埋めるために,MLLMの長鎖推論能力を,多様でオープンな,挑戦的な質問で正確かつ包括的に評価する新しいベンチマークであるMMReasonを紹介した。まず,様々な分野(6分野)と複数の難易度(大学前から大学,基礎から競争レベル)から多段階の推論を必要とする課題を整理する。第二に、これらの質問はオープンな形式に再構成され、マルチモデル投票技術を用いてフィルタリングされ、推測と記憶に関連するショートカットケースを排除し、堅牢な推論評価を保証する。第三に、質問に詳細なステップバイステップのソリューションで注釈を付け、中間推論ステップを確実に評価するための基準ベースの3次スコアリング機構を設計する。 MMReasonでは、人気のあるMLLMをベンチマークし、その推論能力を詳細に分析する。 MMReasonがMLLM推論研究の進展に有用な情報源になることを願っている。コードはhttps://github.com/HJYao00/MMReason.comから入手できる。

関連論文リスト

MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning [40.55833679660528]
我々は,マルチモーダル推論を明示的思考で厳格に評価する新しいベンチマークMMLU-Reasonを紹介する。 MMLU-Reasonは1)記号深度とマルチホップ要求の6つの異なる推論タイプにまたがる1,083の質問の高拡散データセットからなる。全体として、MMLU-Reasonは、次世代のマルチモーダル推論システムを評価し、比較し、改善するためのスケーラブルな基盤を提供する。
論文参考訳（メタデータ） (2025-05-22T09:41:55Z)
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models [50.43793764203352]
実世界のK-12試験を通してMLLMの推論能力を評価する多分野ベンチマークであるMDK12-Benchを紹介する。本ベンチマークは,小学校から12年生までの様々な難易度にまたがる140Kの推論事例からなる。 6,827のインスタンスレベルの知識ポイントアノテーションが,十分に整理された知識構造,詳細な回答説明,難易度ラベル,年次分割に基づいている。
論文参考訳（メタデータ） (2025-04-08T08:06:53Z)
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [103.0226977561914]
大規模言語モデルにおけるステップバイステップの視覚的推論を促進するための包括的フレームワークを提案する。マルチステップ推論タスクの評価に特化して設計された視覚推論ベンチマークを導入する。第二に,個々のステップの粒度で視覚的推論品質を評価する新しい指標を提案する。第3に、マルチステップのカリキュラム学習アプローチを用いて学習したLlamaV-o1という新しいマルチモーダル視覚推論モデルを提案する。
論文参考訳（メタデータ） (2025-01-10T18:59:51Z)
A Survey on Benchmarks of Multimodal Large Language Models [65.87641718350639]
本稿では,Multimodal Large Language Models (MLLM) のベンチマークと評価について概説する。本研究では,(1)知覚と理解,(2)認知と推論,(3)特定のドメイン,(4)キー能力,(5)他のモダリティに着目した。我々のキーとなる主張は、MLLMの開発をより良いものにするための重要な規律として評価されるべきである、ということである。
論文参考訳（メタデータ） (2024-08-16T09:52:02Z)
Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models [46.26140720993383]
Multi-LogiEvalは、様々な推論規則と深さを持つ多段階論理推論を含む総合的な評価データセットである。 GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, Mistralなどの大規模言語モデルの評価を行った。
論文参考訳（メタデータ） (2024-06-24T23:02:56Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
大規模言語モデル(LLM)は、問題解決と意思決定の能力の向上を示している。本稿ではメタ推論技術を必要とするプロセスベースのベンチマークMR-Benを提案する。メタ推論のパラダイムは,システム2のスロー思考に特に適しています。
論文参考訳（メタデータ） (2024-06-20T03:50:23Z)
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning [16.032320995230734]
CMMUは,中国語における多モーダル・多型質問理解と推論のための新しいベンチマークである。 CMMUは7科目で3,603質問で構成され、小学校から高校までの知識をカバーしている。本稿では,複数質問に対する位置誤差分散という評価手法を提案する。
論文参考訳（メタデータ） (2024-01-25T08:22:10Z)
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models [50.03163753638256]
MLLM(Multi-modal Large Language Models)は人工知能の分野で注目されている。本ベンチマークは, 帰納的, 帰納的, 類推的推論の3つの主要な推論カテゴリから構成される。我々は,この厳密に開発されたオープンエンド多段階精巧な推論ベンチマークを用いて,代表MLLMの選択を評価する。
論文参考訳（メタデータ） (2023-11-20T07:06:31Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。