Related papers: XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

URL: http://arxiv.org/abs/2503.23771v1
Date: Mon, 31 Mar 2025 06:41:18 GMT
Title: XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
Authors: Fengxiang Wang, Hongzhen Wang, Mingshuo Chen, Di Wang, Yulin Wang, Zonghao Guo, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, Zhiyuan Liu, Maosong Sun,
Abstract summary: We present XLRS-Bench, a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios.<n>All evaluation samples meticulously annotated manually, assisted by a novel semi-automatic captioner on ultra-high-resolution RS images.<n>The results of both general and RS-focused MLLMs on XLRS-Bench indicate that further efforts are needed for real-world RS applications.
Score: 68.3805081483279
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The astonishing breakthrough of multimodal large language models (MLLMs) has necessitated new benchmarks to quantitatively assess their capabilities, reveal their limitations, and indicate future research directions. However, this is challenging in the context of remote sensing (RS), since the imagery features ultra-high resolution that incorporates extremely complex semantic relationships. Existing benchmarks usually adopt notably smaller image sizes than real-world RS scenarios, suffer from limited annotation quality, and consider insufficient dimensions of evaluation. To address these issues, we present XLRS-Bench: a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios. XLRS-Bench boasts the largest average image size (8500$\times$8500) observed thus far, with all evaluation samples meticulously annotated manually, assisted by a novel semi-automatic captioner on ultra-high-resolution RS images. On top of the XLRS-Bench, 16 sub-tasks are defined to evaluate MLLMs' 10 kinds of perceptual capabilities and 6 kinds of reasoning capabilities, with a primary emphasis on advanced cognitive processes that facilitate real-world decision-making and the capture of spatiotemporal changes. The results of both general and RS-focused MLLMs on XLRS-Bench indicate that further efforts are needed for real-world RS applications. We have open-sourced XLRS-Bench to support further research in developing more powerful MLLMs for remote sensing.

Related papers

A Vision Centric Remote Sensing Benchmark [21.48675282619887]
This study investigates the limitations of CLIP-based MLLMs in remote sensing tasks.<n>We introduce a remote sensing multimodal visual patterns (RSMMVP) benchmark.<n>It is designed to evaluate MLLMs in RS tasks by identifying the CLIP-blind pairs.<n>We analyze the performance of state-of-the-art MLLMs, revealing significant limitations in RS specific representation learning.
arXiv Detail & Related papers (2025-03-20T03:03:46Z)
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity. RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning. We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z)
mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA [78.45521005703958]
multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge. We propose a novel framework called textbfRetrieval-textbfReftextbfAugmented textbfGeneration (mR$2$AG) which achieves adaptive retrieval and useful information localization. mR$2$AG significantly outperforms state-of-the-art MLLMs on INFOSEEK and Encyclopedic-VQA
arXiv Detail & Related papers (2024-11-22T16:15:50Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding. It aims to localize instances of interest across multiple images based on open-ended text prompts. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions? [86.94616033250068]
R-Bench is a benchmark focused on the **Real-world Robustness of LMMs**. We show that while LMMs can correctly handle the original reference images, their performance is not stable when faced with distorted images. We hope that R-Bench will inspire improving the robustness of LMMs, **extending them from experimental simulations to the real-world application**.
arXiv Detail & Related papers (2024-10-07T20:12:08Z)
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models [57.280853324896306]
Multimodal large language models (MLLMs) struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. We introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K&8K images. We propose Divide, Conquer and Combine (DC$2$), a novel training-free framework for enhancing MLLM perception of HR images.
arXiv Detail & Related papers (2024-08-28T06:09:02Z)
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models [49.070801221350486]
multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. We propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images.
arXiv Detail & Related papers (2024-06-09T17:25:47Z)
Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution [49.902047563260496]
We develop the first attempt to integrate the Vision State Space Model (Mamba) for remote sensing image (RSI) super-resolution. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR. Our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM)
arXiv Detail & Related papers (2024-05-08T11:09:24Z)
MileBench: Benchmarking MLLMs in Long Context [31.211260223575092]
We introduce MileBench, a benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. We systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Results show that while the closed-source GPT-4o outperforms others, most open-source MLLMs struggle in long-context situations.
arXiv Detail & Related papers (2024-04-29T09:19:05Z)
LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model [10.280417075859141]
We introduce LHRS-Bot, an MLLM tailored for RS image understanding through a novel vision-language alignment strategy and a curriculum learning method. Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain.
arXiv Detail & Related papers (2024-02-04T15:46:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.