KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge
- URL: http://arxiv.org/abs/2508.14080v1
- Date: Tue, 12 Aug 2025 19:43:44 GMT
- Title: KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge
- Authors: Guanghao Jin, Jingpei Wu, Tianpei Guo, Yiyi Niu, Weidong Zhou, Guoyang Liu,
- Abstract summary: We propose KnowDR-REC, built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image.<n>We evaluate 16 state-of-the-art multimodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks.
- Score: 1.5833270109954136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Referring Expression Comprehension (REC) is a popular multimodal task that aims to accurately detect target objects within a single image based on a given textual expression. However, due to the limitations of earlier models, traditional REC benchmarks either rely solely on intra-image cues or lack sufficiently fine-grained instance annotations, making them inadequate for evaluating the reasoning capabilities of Multi-modal Large Language Models (MLLMs). To address this gap, we propose a new benchmark, KnowDR-REC, characterized by three key features: Firstly, it is built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image. Secondly, the dataset includes elaborately constructed negative samples via fine-grained expression editing, designed to evaluate a model's robustness and anti-hallucination ability. Lastly, we introduce three novel evaluation metrics to systematically explore the model's internal reasoning process. We evaluate 16 state-of-the-art multimodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks. Furthermore, we observe a decoupling between textual understanding and visual grounding in MLLMs, where many models are significantly influenced by memorized shortcut correlations, which severely affect their behavior on our benchmark and hinder genuine multimodal reasoning. We anticipate that the proposed benchmark will inspire future research towards developing more robust, interpretable, and knowledge-intensive visual grounding frameworks, driving the development of more reliable and robust multimodal systems for complex real-world scenarios.
Related papers
- Multimodal UNcommonsense: From Odd to Ordinary and Ordinary to Odd [1.0888485668490169]
We introduce Multimodal UNsense(MUN), a benchmark designed to evaluate models' ability to handle scenarios that deviate from typical visual or contextual expectations.<n>MUN pairs visual scenes with surprising or unlikely outcomes described in natural language, prompting models to either rationalize seemingly odd images using everyday logic or uncover unexpected interpretations in ordinary scenes.<n>Experiments show an average improvement of 8.3% over baseline ICL methods, highlighting the effectiveness of R-ICL in low-frequency, atypical settings.
arXiv Detail & Related papers (2026-02-02T02:54:34Z) - PENDULUM: A Benchmark for Assessing Sycophancy in Multimodal Large Language Models [43.767942065379366]
Sycophancy is a tendency of AI models to agree with user input at the expense of factual accuracy or in contradiction of visual evidence.<n>We introduce a comprehensive evaluation benchmark, textitPENDULUM, comprising approximately 2,000 human-curated Visual Question Answering pairs.<n>We observe substantial variability in model robustness and a pronounced susceptibility to sycophantic and hallucinatory behavior.
arXiv Detail & Related papers (2025-12-22T12:49:12Z) - Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models [0.0]
Visual Language Models (VLMs) are powerful generative tools but often produce factually in- accurate outputs.<n>This work introduces a framework for knowledge-guided reasoning inVLMs, leverag- ing structured knowledge graphs for multi-hop verification.<n>We evaluate the framework using hi- erarchical, triple-based and bullet-point based knowledge representations, analyzing their ef-fectiveness in factual accuracy and logical infer- ence.
arXiv Detail & Related papers (2025-11-25T17:34:32Z) - When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought [118.71264263478083]
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning.<n>We include 546 multimodal problems, annotated with intermediate visual images and final answers.
arXiv Detail & Related papers (2025-11-04T18:00:51Z) - LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z) - CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance [10.843417240658992]
Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs)<n>We argue that existing benchmarks for evaluating this ability have critical shortcomings.<n>We introduce a novel benchmark -- Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB)
arXiv Detail & Related papers (2025-08-22T08:17:31Z) - From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Open-vocabulary Situation Recognition [14.16399307533106]
Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR)<n>We exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities.<n>We propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model.
arXiv Detail & Related papers (2025-07-19T16:29:02Z) - CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query.<n>We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs.<n>We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z) - VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.<n>Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes.<n>We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.