Fugu-MT 論文翻訳(概要): Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems

論文の概要: Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems

arxiv url: http://arxiv.org/abs/2509.15839v1
Date: Fri, 19 Sep 2025 10:18:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:11.124695
Title: Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems
Title（参考訳）: 中国の多目的物理問題に対するマルチモーダルLLMの総合ベンチマーク
Authors: Zhongze Luo, Zhenshuai Yin, Yongxin Guo, Zhichao Wang, Jionghao Zhu, Xiaoying Tang,
Abstract要約: 我々は,5つの難易度を含む総合的なベンチマークである,中国の物理推論のためのマルチ物理について紹介する。我々は20種類のMLLMの評価に2つの評価フレームワークを使用し、最終回答精度とステップ・バイ・ステップの整合性の両方を分析した。
参考スコア（独自算出の注目度）: 15.023749693065406
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While multimodal LLMs (MLLMs) demonstrate remarkable reasoning progress, their application in specialized scientific domains like physics reveals significant gaps in current evaluation benchmarks. Specifically, existing benchmarks often lack fine-grained subject coverage, neglect the step-by-step reasoning process, and are predominantly English-centric, failing to systematically evaluate the role of visual information. Therefore, we introduce \textbf {Multi-Physics} for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels, featuring 1,412 image-associated, multiple-choice questions spanning 11 high-school physics subjects. We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought. Furthermore, we systematically study the impact of difficulty level and visual information by comparing the model performance before and after changing the input mode. Our work provides not only a fine-grained resource for the community but also offers a robust methodology for dissecting the multimodal reasoning process of state-of-the-art MLLMs, and our dataset and code have been open-sourced: https://github.com/luozhongze/Multi-Physics.
Abstract（参考訳）: マルチモーダルLSM(MLLM)は顕著な推論の進歩を示すが、物理のような専門的な科学分野への応用は、現在の評価ベンチマークにおいて大きなギャップを顕在化している。具体的には、既存のベンチマークは、詳細な対象範囲を欠くことが多く、ステップ・バイ・ステップの推論プロセスを無視し、主に英語中心であり、視覚情報の役割を体系的に評価することができない。そこで,中国物理学推論における「textbf {Multi-Physics"」を導入し,11人の高校生を対象にした1,412のイメージ関連複数選択質問を特徴とする5つの難易度を含む総合的なベンチマークを行った。我々は20種類のMLLMの評価に2つの評価フレームワークを使用し、最終回答精度とステップ・バイ・ステップの整合性の両方を分析した。さらに,入力モードの変更前後のモデル性能を比較し,難易度と視覚情報の影響を系統的に検討した。我々の研究は、コミュニティにきめ細かいリソースを提供するだけでなく、最先端MLLMのマルチモーダル推論プロセスを分離するための堅牢な方法論も提供しています。

論文の概要: Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems

関連論文リスト