Fugu-MT 論文翻訳(概要): Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

論文の概要: Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

arxiv url: http://arxiv.org/abs/2603.20209v3
Date: Wed, 01 Apr 2026 09:28:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:12.886666
Title: Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs
Title（参考訳）: MLLMに対する子どもの知能テスト : KidGym : MLLMのための2次元グリッドベース推論ベンチマーク
Authors: Hengwei Ye, Yuanting Guan, Yuxuan Ge, Tianying Zhu, Zhenhan Guan, Yijia Zhong, Yijing Zhang, Han Zhang, Yingna Wu, Zheng Tian,
Abstract要約: MLLM(Multimodal Large Language Models)は、LLMの言語的強みとマルチモーダルデータの処理能力を組み合わせた言語モデルである。 MLLMの5つの重要な機能を評価するための総合的な2DグリッドベースのベンチマークであるKidGymを紹介する。
参考スコア（独自算出の注目度）: 7.299886183446607
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human-like competence than language-only models, we take inspiration from the Wechsler Intelligence Scales - an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid-based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs' adaptability and developmental potential, mirroring the stages of children's cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user-customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state-of-the-art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: https://bobo-ye.github.io/KidGym/.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、LLMの言語的強みとマルチモーダルデータを処理する能力を組み合わせることで、より広い範囲の視覚的タスクに対処する。 MLLMは言語のみのモデルよりも、より汎用的で人間的な能力を目指しているので、私たちはWechsler Intelligence Scalesからインスピレーションを得ます。 MLLMの5つの重要な機能(実行、知覚推論、学習、記憶、計画)を評価するための総合的な2DグリッドベースのベンチマークであるKidGymを紹介する。このベンチマークには12のユニークなタスクが含まれており、それぞれが少なくとも1つのコア能力をターゲットにしており、特にMLLMの適応性と発達可能性を高め、子供の認知的成長の段階を反映するように設計されている。さらに、我々のタスクはランダムに生成されたレイアウトを持つ多様なシナリオやオブジェクトを含み、より正確で堅牢なMLLM機能を実現する。 KidGymは、完全にユーザカスタマイズ可能で拡張可能で、研究者が新しい評価シナリオを作成し、急速に成長するMLLMコミュニティに対応するために、異なるレベルの調整ができるように設計されている。 KidGymを用いた最先端MLLMの評価を通じて、モデル機能に関する重要な洞察を明らかにし、現在のモデルのいくつかの制限を明らかにした。ベンチマークはhttps://bobo-ye.github.io/KidGym/.com/で公開しています。

論文の概要: Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

関連論文リスト