Related papers: GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

URL: http://arxiv.org/abs/2505.21375v2
Date: Tue, 04 Nov 2025 15:32:06 GMT
Title: GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution
Authors: Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang,
Abstract summary: GeoLLaVA-8K is the first RS-focused multimodal large language model capable of handling inputs up to 8K$times$8K resolution.<n>SuperRS-VQA and HighRS-VQA are the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks.
Score: 66.85537534339238
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.

Related papers

Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding [78.26501371437013]
Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition.<n>We find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors.<n>We propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures; and (2) "pre-warming" on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL.
arXiv Detail & Related papers (2026-02-15T16:40:33Z)
GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery [69.05066425853326]
"thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools.<n>This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny.<n>We propose GeoEyes, a training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom
arXiv Detail & Related papers (2026-02-15T15:50:55Z)
SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping [3.8902217877872034]
High-resolution mapping of canopy height is essential for forest management and biodiversity monitoring.<n>We present SERA-H, an end-to-end model combining a super-resolution module and temporal attention encoding.<n>Our model generates 2.5 m resolution height maps from freely available Sentinel-1 and Sentinel-2 time series data.
arXiv Detail & Related papers (2025-12-19T23:23:14Z)
VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing [59.73939718087177]
Single-encoder vision-language model trained contrastively to embed interleaved inputs in a unified vector space.<n>VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing.
arXiv Detail & Related papers (2025-12-12T11:39:35Z)
Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search [44.758226499411904]
ZoomSearch is a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA)<n>When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS.
arXiv Detail & Related papers (2025-11-25T16:25:54Z)
SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards [23.02076024811612]
DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL)<n>In this paper, we introduce SATORI ($textbfSpatially$ $textbfAnchored$ $textbfTask$ $textbfOptimization$ with $textbfRetextbfInforcement$ Learning), which decomposes VQA into three verifiable stages.<n> Experiments demonstrate consistent performance improvements across seven VQA benchmarks, achieving up to $15.7%$ improvement in
arXiv Detail & Related papers (2025-05-25T11:11:06Z)
8-Calves Image dataset [0.8233028449337972]
We introduce the 8-Calves dataset, a challenging benchmark for multi-animal detection, tracking, and identification.<n>It features a one-hour video of eight Holstein Friesian calves in a barn, with frequent occlusions, motion blur, and diverse poses.<n>A semi-grained pipeline using a fine-tuned YOLOv8 detector and ByteTrack, followed by manual correction, provides over 537,000 bounding boxes with temporal identity labels.
arXiv Detail & Related papers (2025-03-17T23:47:52Z)
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning [31.696397337675847]
Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images.<n>We propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration.<n>Our method outperforms existing high-resolution strategies on four datasets using the same data.
arXiv Detail & Related papers (2025-03-10T17:51:16Z)
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing [32.85223015863783]
GeoPixel is an end-to-end high resolution RS-LMM that supports pixel-level grounding.<n>It supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis.<n>GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks.
arXiv Detail & Related papers (2025-01-23T18:59:30Z)
RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model [59.37279559684668]
We introduce RS-vHeat, an efficient multi-modal remote sensing foundation model.<n>Specifically, RS-vHeat applies the Heat Conduction Operator (HCO) with a complexity of $O(N1.5)$ and a global receptive field.<n>Compared to attention-based remote sensing foundation models, we reduce memory usage by 84%, FLOPs by 24% and improves throughput by 2.7 times.
arXiv Detail & Related papers (2024-11-27T01:43:38Z)
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models [26.322856874796702]
Vision transformers (ViTs) struggle to capture fine-grained details from less prominent objects, charts, and embedded text. We extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs.
arXiv Detail & Related papers (2024-06-03T04:17:12Z)
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis [48.06425266787859]
This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis.<n>VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD) and an honest instruction dataset comprising both factual and deceptive questions (HnstD)<n>In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding.
arXiv Detail & Related papers (2024-03-29T14:50:43Z)
SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection [79.23689506129733]
We establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created.
arXiv Detail & Related papers (2024-03-11T09:20:40Z)
A Novel Multi-scale Attention Feature Extraction Block for Aerial Remote Sensing Image Classification [9.388978548253755]
We propose a novel plug-and-play multi-scale attention feature extraction block (MSAFEB) based on multi-scale convolution at two levels with skip connection. The experimental study on two benchmark VHR aerial RS image datasets (AID and NWPU) demonstrates that our proposal achieves a stable/consistent performance (minimum standard deviation of $0.002$) and competent overall classification performance (AID: 95.85% and NWPU: 94.09%)
arXiv Detail & Related papers (2023-08-27T11:49:46Z)
Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection [68.65338791283298]
Salient Object Detection (SOD) aims to identify and segment the most conspicuous objects in an image or video. Traditional SOD methods are largely limited to low-resolution images, making them difficult to adapt to the development of High-Resolution SOD. In this work, we first propose a new HRS10K dataset, which contains 10,500 high-quality annotated images at 2K-8K resolution.
arXiv Detail & Related papers (2023-08-07T17:49:04Z)
Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning. We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z)
An Attention-Fused Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery [26.362854938949923]
We propose a novel convolutional neural network architecture, named attention-fused network (AFNet) We achieve state-of-the-art performance with an overall accuracy of 91.7% and a mean F1 score of 90.96% on the ISPRS Vaihingen 2D dataset and the ISPRS Potsdam 2D dataset.
arXiv Detail & Related papers (2021-05-10T06:23:27Z)
Self-Supervised Representation Learning for RGB-D Salient Object Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation. Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts. For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.