Related papers: VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

URL: http://arxiv.org/abs/2510.12750v1
Date: Tue, 14 Oct 2025 17:29:52 GMT
Title: VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage
Authors: A. Alfarano, L. Venturoli, D. Negueruela del Castillo,
Abstract summary: VQArt-Bench is a large-scale Visual Question Answering benchmark for the cultural heritage domain.<n>It is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions.<n>Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model's ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open-source models.

Related papers

PENDULUM: A Benchmark for Assessing Sycophancy in Multimodal Large Language Models [43.767942065379366]
Sycophancy is a tendency of AI models to agree with user input at the expense of factual accuracy or in contradiction of visual evidence.<n>We introduce a comprehensive evaluation benchmark, textitPENDULUM, comprising approximately 2,000 human-curated Visual Question Answering pairs.<n>We observe substantial variability in model robustness and a pronounced susceptibility to sycophantic and hallucinatory behavior.
arXiv Detail & Related papers (2025-12-22T12:49:12Z)
MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes [1.0799568216202955]
A key frontier for Multimodal Large Language Models (MLLMs) is the ability to perform deep mathematical and spatial reasoning directly from images.<n>MaRVL-QA is a new benchmark designed to quantitatively evaluate these core reasoning skills.<n>MaRVL-QA reveals that even state-of-the-art MLLMs struggle significantly, often resorting to the superficials instead of robust spatial reasoning.
arXiv Detail & Related papers (2025-08-24T01:24:56Z)
VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs [12.64051404166593]
This paper introduces VisualQuest, a novel image dataset designed to assess the ability of large language models to interpret non-traditional, stylized imagery.<n>Unlike conventional photographic benchmarks, VisualQuest challenges models with images that incorporate abstract, symbolic, and metaphorical elements.
arXiv Detail & Related papers (2025-03-25T01:23:11Z)
Vision-Language Models Struggle to Align Entities across Modalities [13.100184125419695]
Cross-modal entity linking is a fundamental skill needed for real-world applications such as multimodal code generation.<n>Our benchmark, MATE, consists of 5.5k evaluation instances featuring visual scenes aligned with their textual representations.<n>We evaluate state-of-the-art Vision-Language Models (VLMs) and humans on this task, and find thatVLMs struggle significantly compared to humans.
arXiv Detail & Related papers (2025-03-05T19:36:43Z)
Understanding Museum Exhibits using Vision-Language Reasoning [52.35301212718003]
Museums serve as repositories of cultural heritage and historical artifacts from diverse epochs, civilizations, and regions.<n> domain-specific models are essential for interactive query resolution and gaining historical insights.<n>We collect and curate a large-scale dataset of 65M images and 200M question-answer pairs for exhibits from all around the world.
arXiv Detail & Related papers (2024-12-02T10:54:31Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.<n>It aims to localize instances of interest across multiple images based on open-ended text prompts.<n>We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts [65.04791072532106]
We present LoCoVQA, a benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs) LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts. This test assesses how well VLMs can ignore irrelevant information when answering queries.
arXiv Detail & Related papers (2024-06-24T17:58:03Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision [64.83085920775316]
We introduce NPHardEval4V, a multimodal benchmark suite grounded in four classical NP-hard problems.<n>Each task is presented through a combination of structured visual layouts and textual prompts, designed to assess the ability of LVLMs to perform reasoning under visual-linguistic constraints.<n>Our results show that while these models perform reasonably well on perception-based inputs, they struggle with global optimization, abstraction, and constraint satisfaction.
arXiv Detail & Related papers (2024-03-04T07:10:31Z)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.