Fugu-MT 論文翻訳(概要): LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

論文の概要: LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

arxiv url: http://arxiv.org/abs/2605.26781v1
Date: Tue, 26 May 2026 09:50:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.800604
Title: LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
Title（参考訳）: LiveK12Bench: 大規模マルチモーダルモデルに高校レベルの試験は本当に必要か?
Authors: Xiaohan Wang, Mingze Yin, Yilin Zhao, Gang Liu, Dian Li,
Abstract要約: 現実的な検査シナリオにおけるLMMの推論能力を評価するために設計された,動的で全体論的,多分野のベンチマークであるLiveK12Benchを紹介する。 LiveK12Benchは、数学、物理学、化学、生物学にまたがる2K以上の検証済みの質問で構成されている。 1)データ漏洩を軽減するために最新の検査論文を継続的に取り込み解析する自動パイプライン、2)正確かつ効率的な推論パスでエンドツーエンドの試験を自律的に完了する能力を評価する新しいモックエクサムの評価スキームを提案する。
参考スコア（独自算出の注目度）: 46.42524695322652
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contamination, and are often confined to restricted modalities, disciplines, and evaluation criteria. To address these issues, we introduce LiveK12Bench, a dynamic, holistic, multi-disciplinary benchmark designed to evaluate the reasoning abilities of LMMs in realistic examination scenarios. LiveK12Bench comprises 2K+ verified questions spanning Mathematics, Physics, Chemistry, and Biology, sourced from the latest real-world exam papers and designed to grow over time. Our framework features several core innovations: 1) featuring an automated pipeline that continuously ingests and parses the latest examination papers to mitigate data leakage; and 2) proposing a novel `Mock Exam' evaluation scheme, which assesses the ability to complete end-to-end exams autonomously with accurate and efficient reasoning paths. Extensive experiments on 12 LMMs reveal that advanced models suffer substantial performance degradation under exam-realistic constraints: GPT-5's score drops from 79 to 53 (out of 100) when process rigor and efficiency are jointly evaluated. Our findings expose critical vulnerabilities, such as sensitivity to complex visual layouts, highlighting the gap between idealized reasoning capabilities and true educational readiness. Both code and dataset are publicly available.
Abstract（参考訳）: 先進的大規模マルチモーダルモデル(LMM)は、K-12推論タスクにおいて優れた性能を示し、インテリジェントチューターとして非常に有望である。この可能性を実現するためには、実世界の検査を効果的にナビゲートするモデルが必要であるが、既存のベンチマークのほとんどは、真のテスト環境の複雑さを捉えていない。具体的には、ほとんどのデータセットは静的であり、データ汚染の傾向があり、しばしば制限されたモダリティ、規律、評価基準に制限される。これらの問題に対処するために、現実的な検査シナリオにおけるLMMの推論能力を評価するために設計された、動的で総合的で多分野のベンチマークであるLiveK12Benchを紹介する。 LiveK12Benchは、数学、物理学、化学、生物学にまたがる2K以上の検証済みの質問で構成されている。私たちのフレームワークには、いくつかのコアイノベーションがあります。 1) データ漏洩を軽減するため,最新の検査書類を継続的に取り込み,解析する自動パイプラインを特徴とする。 2) 正確かつ効率的な推論経路でエンドツーエンドの試験を自律的に完了する能力を評価する新しい「モックエクサム」評価手法を提案する。 GPT-5のスコアが79から53(100点中)に低下する。以上の結果から,複雑な視覚的レイアウトに対する感受性や,理想的な推論能力と真の教育的準備のギャップが指摘された。コードとデータセットの両方が公開されている。

論文の概要: LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

関連論文リスト