VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing
- URL: http://arxiv.org/abs/2602.07045v1
- Date: Wed, 04 Feb 2026 08:21:33 GMT
- Title: VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing
- Authors: Zhiming Luo, Di Wang, Haonan Guo, Jing Zhang, Bo Du,
- Abstract summary: We propose a Vision Language ReaSoning Benchmark (VLRS-Bench) for complex remote sensing reasoning.<n>VLRS-Bench comprises 2,000 question-answer pairs with an average length of 71 words, spanning 14 tasks and up to eight temporal phases.
- Score: 49.12477222994131
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, , we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average length of 71 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community.
Related papers
- Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings [5.30570508258782]
RE2-Bench is a benchmark of 1,101 reasoning problems, including 195 drawn from mature real-world projects.<n>A comprehensive evaluation of six general-purpose and reasoning-oriented LLMs on two widely used code reasoning tasks using RE2-Bench reveals a significant performance drop from Easy to Hard problems.
arXiv Detail & Related papers (2025-12-16T21:12:53Z) - MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs [61.70050081221131]
MVU-Eval is the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs.<n>Our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos.<n>These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics.
arXiv Detail & Related papers (2025-11-10T16:02:33Z) - Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective [53.594353527056775]
We propose Chinese Commonsense Multi-hop Reasoning ( CCMOR) to evaluate Large Language Models (LLMs)<n> CCMOR is designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning.<n>We implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions.
arXiv Detail & Related papers (2025-10-09T20:29:00Z) - OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding [50.72259772580637]
We introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent.<n>Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes.<n>We find that both complex-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes.
arXiv Detail & Related papers (2025-07-10T17:56:07Z) - StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns [7.60350050736492]
Long-term memory is essential for large language models to achieve autonomous intelligence.<n>Existing benchmarks face challenges in evaluating knowledge retention and dynamic sequential reasoning.<n>We propose a novel benchmark framework based on interactive fiction games.
arXiv Detail & Related papers (2025-06-16T10:54:31Z) - SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z) - Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges [4.668749313973097]
This paper systematically evaluate Large Language Models (LLMs) and Large Reasoning Models (LRMs) across three levels of reasoning complexity.<n>We curate 26 challenges where models answer directly or by Python Code Interpreter.<n>LRMs show robust performance across tasks with various levels of difficulty, often competing or surpassing traditional first-principle-based methods.
arXiv Detail & Related papers (2025-05-16T18:32:35Z) - XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? [68.3805081483279]
We present XLRS-Bench, a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios.<n>All evaluation samples meticulously annotated manually, assisted by a novel semi-automatic captioner on ultra-high-resolution RS images.<n>The results of both general and RS-focused MLLMs on XLRS-Bench indicate that further efforts are needed for real-world RS applications.
arXiv Detail & Related papers (2025-03-31T06:41:18Z) - A Vision Centric Remote Sensing Benchmark [21.48675282619887]
This study investigates the limitations of CLIP-based MLLMs in remote sensing tasks.<n>We introduce a remote sensing multimodal visual patterns (RSMMVP) benchmark.<n>It is designed to evaluate MLLMs in RS tasks by identifying the CLIP-blind pairs.<n>We analyze the performance of state-of-the-art MLLMs, revealing significant limitations in RS specific representation learning.
arXiv Detail & Related papers (2025-03-20T03:03:46Z) - MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning? [21.638848019633595]
We propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition.<n>We study many novel problems for many-shot in-context reasoning, and acquired many insightful findings.
arXiv Detail & Related papers (2025-02-14T06:05:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.