Fugu-MT 論文翻訳(概要): VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

論文の概要: VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

arxiv url: http://arxiv.org/abs/2510.12750v1
Date: Tue, 14 Oct 2025 17:29:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.420179
Title: VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage
Title（参考訳）: VQArt-Bench: 意味的に豊かなVQAベンチマーク
Authors: A. Alfarano, L. Venturoli, D. Negueruela del Castillo,
Abstract要約: VQArt-Benchは、文化遺産ドメインの大規模なビジュアル質問回答ベンチマークである。特殊なエージェントが協力して、ニュアンス、検証、言語学的に多様な質問を生成する、新しいマルチエージェントパイプラインを使用して構築されている。このベンチマークによる14の最先端MLLMの評価は、現在のモデルに重大な制限があることを示唆している。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model's ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open-source models.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は視覚的・言語的な共同作業において重要な機能を示す。しかしながら、既存のVisual Question Answering (VQA)ベンチマークは、特にビジュアルアート分析のような複雑な領域において、深い意味理解を評価するのに失敗することが多い。単純な構文構造や表面レベルの特性に照らして、これらの疑問は人間の視覚的探究の多様性と深さを捉えていない。この制限は、視覚的推論ではなく統計的ショートカットを利用するモデルにインセンティブを与える。このギャップに対処するため、文化遺産ドメイン用の大規模VQAベンチマークであるVQArt-Benchを紹介します。このベンチマークは、特殊なエージェントが協力して、ニュアンス、検証、言語学的に多様な質問を生成する、新しいマルチエージェントパイプラインを使用して構築される。得られたベンチマークは、モデルが象徴的な意味、物語、複雑な視覚的関係を解釈する能力を調べる、関連する視覚的理解次元に沿って構成されている。このベンチマークによる14の最先端MLLMの評価では、単純なカウントタスクの驚くほどの弱点や、プロプライエタリモデルとオープンソースモデルの明確なパフォーマンスギャップなど、現在のモデルに重大な制限が示されています。

論文の概要: VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

関連論文リスト