Fugu-MT 論文翻訳(概要): Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

論文の概要: Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

arxiv url: http://arxiv.org/abs/2606.05531v1
Date: Thu, 04 Jun 2026 00:21:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.446878
Title: Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models
Title（参考訳）: Almieyar-Oryx-BloomBench:視覚言語モデルの認知的インフォームド評価のためのバイリンガルマルチモーダルベンチマーク
Authors: Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari,
Abstract要約: BloomBenchは、視覚言語モデルのための、認知的に人間的、バイリンガルな(英語-アラビア語)マルチモーダルベンチマークである。我々は,その認知的プロファイルを診断するために最先端のVLMを研究した。本研究は、アラビア語と英語における重要なパフォーマンスギャップを浮き彫りにして、現在の言語間多モーダル推論における限界を明らかにするものである。
参考スコア（独自算出の注目度）: 4.827220845523129
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.
Abstract（参考訳）: VLM(Vision-Language Models)の急速な進歩にもかかわらず、この分野は、真の推論能力の厳格な診断と、人間のようなマルチモーダルインテリジェンスへの有意義な進歩を示すベンチマークを欠いている。既存の評価のほとんどは、断片的または非連結的なタスクに焦点を当て、批判的な認知の弱点を隠蔽し、目標とする改善の洞察をほとんど与えない。このギャップに対処するため、VLMのための最初の認知的人間によるバイリンガル(英語-アラビア語)マルチモーダルベンチマークであるAlmieyarベンチマークシリーズの一部であるBloomBenchを紹介した。ブルームベンチはブルームの分類学に基づいて、6段階の認知(記憶、理解、応用、分析、評価、創造)を慎重にデザインされた画像検索タスクを通して体系的に評価している。半自動パイプラインで構築され、階層化されたハイブリッド品質保証プロトコルを通じて検証される。この枠組みを活用することで、認知的プロファイルを診断するために最先端のVLMを包括的に研究する。我々の分析は、最先端のモデルが意味理解において強いパフォーマンス天井を達成する一方で、事実のリコールや創造的な合成にかなり苦労している、という鋭い認知的非対称性を明らかにしている。これは、現在の一般的なマルチモーダル習熟度マスクが特定の認知層のより深い制限を覆っていることを示している。さらに,アラビア語と英語の間には重要なパフォーマンスギャップがあり,言語間多モーダル推論の限界が明らかになっている。これらの知見は、より認知的に整合し包括的VLMを開発するための基盤を確立する。ベンチマークフレームワークとデータセットは、https://github.com/qcri/Almieyar-Oryx-BloomBench.comで公開されている。

論文の概要: Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

関連論文リスト