Related papers: Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

URL: http://arxiv.org/abs/2511.22843v1
Date: Fri, 28 Nov 2025 02:22:06 GMT
Title: Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering
Authors: Dosung Lee, Sangwon Jung, Boyoung Kim, Minyoung Kim, Sungyeon Kim, Junyoung Sung, Paul Hongsuck Seo,
Abstract summary: Multimodal Knowledge-Based Visual Question Answering suffers from "visual shortcuts"<n>We introduce Multi-Image MultImodal Retriever (MIMIR) which enriches document embeddings by augmenting images of related entities.<n>Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of MIMIR.
Score: 29.782721931657544
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from "visual shortcuts", as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR. Our project is available at: Project Page.

Related papers

More Images, More Problems? A Controlled Analysis of VLM Failure Modes [80.64323947730905]
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored.<n>We introduce MIMIC, a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs.
arXiv Detail & Related papers (2026-01-12T18:45:13Z)
Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z)
QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models [50.51641024244313]
We investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images.<n>Inspired by the findings, we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions (QG-CoC)<n>We evaluate our method on various open-source and closed-source MLLMs for multi-image and single-image benchmarks.
arXiv Detail & Related papers (2025-11-05T05:49:48Z)
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval [56.12310817934239]
Cross-modal embeddings behave as bags of concepts and underrepresent structured visual relationships such as pose and viewpoint.<n>We propose Visualize-then-Retrieve (VisRet), a new paradigm for T2I retrieval that mitigates this limitation of cross-modal similarity alignment.<n>VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching.
arXiv Detail & Related papers (2025-05-26T17:59:33Z)
Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning [12.17399365931]
Existing one-pass MLLMs process entire document images without considering query relevance.<n>Inspired by the human coarse-to-fine reading pattern, we introduce Doc-CoB, a simple-yet-effective mechanism that integrates human-style visual reasoning into MLLM.<n>Our method allows the model to autonomously select the set of regions most relevant to the query, and then focus attention on them for further understanding.
arXiv Detail & Related papers (2025-05-24T08:53:05Z)
Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering [60.062194349648195]
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents.<n>Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches.<n>We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains.
arXiv Detail & Related papers (2025-05-22T09:52:57Z)
ABC: Achieving Better Control of Multimodal Embeddings using VLMs [61.396457715710774]
Visual embedding models excel at zero-shot tasks like visual retrieval and classification.<n>These models cannot be used for tasks that contain ambiguity or require user instruction.<n>We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone.
arXiv Detail & Related papers (2025-03-01T03:29:02Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
VP-MEL: Visual Prompts Guided Multimodal Entity Linking [16.463229055333407]
Multimodal entity linking (MEL) is a task aimed at linking mentions within multimodal contexts to their corresponding entities in a knowledge base (KB)<n>Existing MEL methods often rely on mention words as retrieval cues, which limits their ability to effectively utilize information from both images and text.<n>We propose a framework named IIER, which enhances visual feature extraction using visual prompts and leverages the pretrained Detective-VLM model to capture latent information.
arXiv Detail & Related papers (2024-12-09T18:06:39Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.