GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution
- URL: http://arxiv.org/abs/2505.21375v2
- Date: Tue, 04 Nov 2025 15:32:06 GMT
- Title: GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution
- Authors: Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang,
- Abstract summary: GeoLLaVA-8K is the first RS-focused multimodal large language model capable of handling inputs up to 8K$times$8K resolution.<n>SuperRS-VQA and HighRS-VQA are the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks.
- Score: 66.85537534339238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.
Related papers
- Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding [78.26501371437013]
Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition.<n>We find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors.<n>We propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures; and (2) "pre-warming" on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL.
arXiv Detail & Related papers (2026-02-15T16:40:33Z) - GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery [69.05066425853326]
"thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools.<n>This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny.<n>We propose GeoEyes, a training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom
arXiv Detail & Related papers (2026-02-15T15:50:55Z) - SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping [3.8902217877872034]
High-resolution mapping of canopy height is essential for forest management and biodiversity monitoring.<n>We present SERA-H, an end-to-end model combining a super-resolution module and temporal attention encoding.<n>Our model generates 2.5 m resolution height maps from freely available Sentinel-1 and Sentinel-2 time series data.
arXiv Detail & Related papers (2025-12-19T23:23:14Z) - VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing [59.73939718087177]
Single-encoder vision-language model trained contrastively to embed interleaved inputs in a unified vector space.<n>VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing.
arXiv Detail & Related papers (2025-12-12T11:39:35Z) - Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search [44.758226499411904]
ZoomSearch is a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA)<n>When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS.
arXiv Detail & Related papers (2025-11-25T16:25:54Z) - SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards [23.02076024811612]
DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL)<n>In this paper, we introduce SATORI ($textbfSpatially$ $textbfAnchored$ $textbfTask$ $textbfOptimization$ with $textbfRetextbfInforcement$ Learning), which decomposes VQA into three verifiable stages.<n> Experiments demonstrate consistent performance improvements across seven VQA benchmarks, achieving up to $15.7%$ improvement in
arXiv Detail & Related papers (2025-05-25T11:11:06Z) - 8-Calves Image dataset [0.8233028449337972]
We introduce the 8-Calves dataset, a challenging benchmark for multi-animal detection, tracking, and identification.<n>It features a one-hour video of eight Holstein Friesian calves in a barn, with frequent occlusions, motion blur, and diverse poses.<n>A semi-grained pipeline using a fine-tuned YOLOv8 detector and ByteTrack, followed by manual correction, provides over 537,000 bounding boxes with temporal identity labels.
arXiv Detail & Related papers (2025-03-17T23:47:52Z) - When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning [31.696397337675847]
Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images.<n>We propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration.<n>Our method outperforms existing high-resolution strategies on four datasets using the same data.
arXiv Detail & Related papers (2025-03-10T17:51:16Z) - GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing [32.85223015863783]
GeoPixel is an end-to-end high resolution RS-LMM that supports pixel-level grounding.<n>It supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis.<n>GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks.
arXiv Detail & Related papers (2025-01-23T18:59:30Z) - RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model [59.37279559684668]
We introduce RS-vHeat, an efficient multi-modal remote sensing foundation model.<n>Specifically, RS-vHeat applies the Heat Conduction Operator (HCO) with a complexity of $O(N1.5)$ and a global receptive field.<n>Compared to attention-based remote sensing foundation models, we reduce memory usage by 84%, FLOPs by 24% and improves throughput by 2.7 times.
arXiv Detail & Related papers (2024-11-27T01:43:38Z) - Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models [26.322856874796702]
Vision transformers (ViTs) struggle to capture fine-grained details from less prominent objects, charts, and embedded text.
We extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it.
This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs.
arXiv Detail & Related papers (2024-06-03T04:17:12Z) - VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis [48.06425266787859]
This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis.<n>VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD) and an honest instruction dataset comprising both factual and deceptive questions (HnstD)<n>In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding.
arXiv Detail & Related papers (2024-03-29T14:50:43Z) - SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection [79.23689506129733]
We establish a new benchmark dataset and an open-source method for large-scale SAR object detection.
Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets.
To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created.
arXiv Detail & Related papers (2024-03-11T09:20:40Z) - A Novel Multi-scale Attention Feature Extraction Block for Aerial Remote
Sensing Image Classification [9.388978548253755]
We propose a novel plug-and-play multi-scale attention feature extraction block (MSAFEB) based on multi-scale convolution at two levels with skip connection.
The experimental study on two benchmark VHR aerial RS image datasets (AID and NWPU) demonstrates that our proposal achieves a stable/consistent performance (minimum standard deviation of $0.002$) and competent overall classification performance (AID: 95.85% and NWPU: 94.09%)
arXiv Detail & Related papers (2023-08-27T11:49:46Z) - Recurrent Multi-scale Transformer for High-Resolution Salient Object
Detection [68.65338791283298]
Salient Object Detection (SOD) aims to identify and segment the most conspicuous objects in an image or video.
Traditional SOD methods are largely limited to low-resolution images, making them difficult to adapt to the development of High-Resolution SOD.
In this work, we first propose a new HRS10K dataset, which contains 10,500 high-quality annotated images at 2K-8K resolution.
arXiv Detail & Related papers (2023-08-07T17:49:04Z) - Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning.
We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs.
Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z) - An Attention-Fused Network for Semantic Segmentation of
Very-High-Resolution Remote Sensing Imagery [26.362854938949923]
We propose a novel convolutional neural network architecture, named attention-fused network (AFNet)
We achieve state-of-the-art performance with an overall accuracy of 91.7% and a mean F1 score of 90.96% on the ISPRS Vaihingen 2D dataset and the ISPRS Potsdam 2D dataset.
arXiv Detail & Related papers (2021-05-10T06:23:27Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.