FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing
- URL: http://arxiv.org/abs/2512.24022v1
- Date: Tue, 30 Dec 2025 06:48:07 GMT
- Title: FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing
- Authors: Yunkai Dang, Donghao Wang, Jiacheng Yang, Yifan Jiang, Meiyi Zhu, Yuekun Yang, Cong Wang, Qi Fan, Wenbin Li, Yang Gao,
- Abstract summary: MF-RSVLM is a multi-Feature Fusion Remote Sensing Vision--Language Model.<n>It learns multi-scale visual representations and combines global context with local details.<n>It achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks.
- Score: 21.38912956638889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language models (VLMs) exhibit strong performance across various tasks. However, these VLMs encounter significant challenges when applied to the remote sensing domain due to the inherent differences between remote sensing images and natural images. Existing remote sensing VLMs often fail to extract fine-grained visual features and suffer from visual forgetting during deep language processing. To address this, we introduce MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision--Language Model that effectively extracts and fuses visual features for RS understanding. MF-RSVLM learns multi-scale visual representations and combines global context with local details, improving the capture of small and complex structures in RS scenes. A recurrent visual feature injection scheme ensures the language model remains grounded in visual evidence and reduces visual forgetting during generation. Extensive experiments on diverse RS benchmarks show that MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks. Our code is publicly available at https://github.com/Yunkaidang/RSVLM.
Related papers
- Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models [58.91911788912665]
We propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discrimi visual representations.<n>Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information.
arXiv Detail & Related papers (2025-12-06T04:20:13Z) - Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z) - On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations [53.611451075703314]
Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning.<n>We show how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks.
arXiv Detail & Related papers (2025-07-30T05:41:29Z) - A Vision Centric Remote Sensing Benchmark [21.48675282619887]
This study investigates the limitations of CLIP-based MLLMs in remote sensing tasks.<n>We introduce a remote sensing multimodal visual patterns (RSMMVP) benchmark.<n>It is designed to evaluate MLLMs in RS tasks by identifying the CLIP-blind pairs.<n>We analyze the performance of state-of-the-art MLLMs, revealing significant limitations in RS specific representation learning.
arXiv Detail & Related papers (2025-03-20T03:03:46Z) - Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method [10.748210940033484]
Large language models (LLMs) and vision-language models (VLMs) have achieved significant success.<n>Due to the substantial differences between remote sensing images and conventional optical images, these models face challenges in comprehension.<n>This letter explores the application of VLMs for object detection in remote sensing images.
arXiv Detail & Related papers (2025-03-11T08:02:54Z) - UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models [23.044366104080822]
We introduce textbfUniRS, the first vision-language model bftextremote bftextsensing tasks across various types of visual input.<n>UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis.<n> Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks.
arXiv Detail & Related papers (2024-12-30T06:34:18Z) - RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity.<n> RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning.<n>We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z) - Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension [6.29665399879184]
We present Aquila, an advanced visual language foundation model for remote sensing images.
Aquila enables richer visual feature representation and more precise visual-language feature alignment.
We validate the effectiveness of Aquila through extensive quantitative experiments and qualitative analyses.
arXiv Detail & Related papers (2024-11-09T05:31:56Z) - Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts [65.04791072532106]
We present LoCoVQA, a benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs)
LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts.
This test assesses how well VLMs can ignore irrelevant information when answering queries.
arXiv Detail & Related papers (2024-06-24T17:58:03Z) - RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images.
RSM is specifically designed to capture the global context of remote sensing images with linear complexity.
Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Remote Sensing Vision-Language Foundation Models without Annotations via
Ground Remote Alignment [61.769441954135246]
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations.
Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language.
arXiv Detail & Related papers (2023-12-12T03:39:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.