Related papers: RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks

RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks

URL: http://arxiv.org/abs/2509.23673v1
Date: Sun, 28 Sep 2025 06:26:11 GMT
Title: RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks
Authors: Amit Agarwal, Hitesh Laxmichand Patel, Srikant Panda, Hansa Meghwani, Jyotika Singh, Karan Dua, Paul Li, Tao Sheng, Sujith Ravi, Dan Roth,
Abstract summary: Region Index (RCI) is the first model-based score to quantify a dataset's reliance on global versus local visual information.<n>When applying RCI to 13 widely used multimodal benchmarks, we observed that most of them favor localized reasoning.
Score: 33.68425991692674
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive results on vision-language benchmarks, yet it remains unclear whether these benchmarks assess genuine global reasoning or allow success via localized visual cues. Existing evaluation methods do not explicitly measure this distinction, hindering effective dataset curation and real-world focused model development. We introduce Region Comprehension Index (RCI), the first model-based score to directly quantify a dataset's reliance on global versus local visual information. RCI systematically compares reference-model performance on image patches versus full images, revealing if tasks require holistic image understanding or can be solved with partial or localized visual cues. When applying RCI to 13 widely used multimodal benchmarks, we observed that most of them favor localized reasoning and exhibit significant spatial biases, indicating potential risks in real-world applications. RCI equips researchers & practitioners with an actionable tool for diagnosing & mitigating these biases, enabling the construction of datasets and benchmarks to foster the development of robust, enterprise-ready multimodal systems.

Related papers

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning [69.75509909581133]
RegionReasoner is a reinforcement learning framework for visual reasoning.<n>It enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes.<n>RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment.
arXiv Detail & Related papers (2026-02-03T16:52:16Z)
KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge [1.5833270109954136]
We propose KnowDR-REC, built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image.<n>We evaluate 16 state-of-the-art multimodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks.
arXiv Detail & Related papers (2025-08-12T19:43:44Z)
From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models [14.178064117544082]
Image geolocalization is important for applications in crisis response, digital forensics, and location-based intelligence.<n>Recent advances in large language models (LLMs) offer new opportunities for visual reasoning.<n>We introduce a benchmark called IMAGEO-Bench that systematically evaluates accuracy, distance error, geospatial bias, and reasoning process.
arXiv Detail & Related papers (2025-08-03T06:04:33Z)
Region-based Cluster Discrimination for Visual Representation Learning [30.79223671093668]
Region-Aware Cluster Discrimination (RICE) is a novel method that enhances region-level visual and OCR capabilities.<n>RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception.
arXiv Detail & Related papers (2025-07-26T17:47:09Z)
Datasets for Fairness in Language Models: An In-Depth Survey [13.944063655545898]
We offer a comprehensive analysis of the most widely used fairness datasets in language model research.<n>We propose a unified evaluation framework that surfaces consistent patterns of demographic disparities across benchmarks and scoring metrics.<n>Our findings highlight an urgent need for new benchmarks that capture a broader range of social contexts and fairness notions.
arXiv Detail & Related papers (2025-06-29T22:11:58Z)
RING#: PR-by-PE Global Localization with Roto-translation Equivariant Gram Learning [20.688641105430467]
Global localization is crucial in autonomous driving and robotics applications when GPS signals are unreliable. Most approaches achieve global localization by sequential place recognition (PR) and pose estimation (PE) We introduce a new paradigm, PR-by-PE localization, which bypasses the need for separate place recognition by directly deriving it from pose estimation. We propose RING#, an end-to-end PR-by-PE localization network that operates in the bird's-eye-view (BEV) space, compatible with both vision and LiDAR sensors.
arXiv Detail & Related papers (2024-08-30T18:42:53Z)
Benchmark Granularity and Model Robustness for Image-Text Retrieval [44.045767657945895]
We show how dataset granularity and query perturbations affect retrieval performance and robustness.<n>We show that richer captions consistently enhance retrieval, especially in text-to-image tasks.<n>Our results highlight variation in model robustness and a dataset-dependent relationship between caption granularity and sensitivity perturbation.
arXiv Detail & Related papers (2024-07-21T18:08:44Z)
Recognize Any Regions [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.<n>Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z)
From Global to Local: Multi-scale Out-of-distribution Detection [129.37607313927458]
Out-of-distribution (OOD) detection aims to detect "unknown" data whose labels have not been seen during the in-distribution (ID) training process. Recent progress in representation learning gives rise to distance-based OOD detection. We propose Multi-scale OOD DEtection (MODE), a first framework leveraging both global visual information and local region details.
arXiv Detail & Related papers (2023-08-20T11:56:25Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
Adaptive Local-Component-aware Graph Convolutional Network for One-shot Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition. Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z)
Representation-Centric Survey of Skeletal Action Recognition and the ANUBIS Benchmark [43.00059447663327]
3D skeleton-based human action recognition has emerged as a powerful alternative to traditional RGB and depth-based approaches.<n>Despite remarkable progress, current research remains fragmented across diverse input representations.<n>ANUBIS is a large-scale, challenging skeleton action dataset designed to address critical gaps in existing benchmarks.
arXiv Detail & Related papers (2022-05-04T14:03:43Z)
Local-Global Associative Frame Assemble in Video Re-ID [57.7470971197962]
Noisy and unrepresentative frames in automatically generated object bounding boxes from video sequences cause challenges in learning discriminative representations in video re-identification (Re-ID) Most existing methods tackle this problem by assessing the importance of video frames according to either their local part alignments or global appearance correlations separately. In this work, we explore jointly both local alignments and global correlations with further consideration of their mutual promotion/reinforcement.
arXiv Detail & Related papers (2021-10-22T19:07:39Z)
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models [41.45240621979654]
We introduce BEIR, a heterogeneous benchmark for information retrieval. We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup. Dense-retrieval models are computationally more efficient but often underperform other approaches.
arXiv Detail & Related papers (2021-04-17T23:29:55Z)
Region Comparison Network for Interpretable Few-shot Image Classification [97.97902360117368]
Few-shot image classification has been proposed to effectively use only a limited number of labeled examples to train models for new classes. We propose a metric learning based method named Region Comparison Network (RCN), which is able to reveal how few-shot learning works. We also present a new way to generalize the interpretability from the level of tasks to categories.
arXiv Detail & Related papers (2020-09-08T07:29:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.