Related papers: Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method

Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method

URL: http://arxiv.org/abs/2503.08144v2
Date: Thu, 20 Mar 2025 13:21:00 GMT
Title: Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method
Authors: Fei Wang, Chengcheng Chen, Hongyu Chen, Yugang Chang, Weiming Zeng,
Abstract summary: Large language models (LLMs) and vision-language models (VLMs) have achieved significant success.<n>Due to the substantial differences between remote sensing images and conventional optical images, these models face challenges in comprehension.<n>This letter explores the application of VLMs for object detection in remote sensing images.
Score: 10.748210940033484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, large language models (LLMs) and vision-language models (VLMs) have achieved significant success, demonstrating remarkable capabilities in understanding various images and videos, particularly in classification and detection tasks. However, due to the substantial differences between remote sensing images and conventional optical images, these models face considerable challenges in comprehension, especially in detection tasks. Directly prompting VLMs with detection instructions often leads to unsatisfactory results. To address this issue, this letter explores the application of VLMs for object detection in remote sensing images. Specifically, we constructed supervised fine-tuning (SFT) datasets using publicly available remote sensing object detection datasets, including SSDD, HRSID, and NWPU-VHR-10. In these new datasets, we converted annotation information into JSON-compliant natural language descriptions, facilitating more effective understanding and training for the VLM. We then evaluate the detection performance of various fine-tuning strategies for VLMs and derive optimized model weights for object detection in remote sensing images. Finally, we evaluate the model's prior knowledge capabilities using natural language queries. Experimental results demonstrate that, without modifying the model architecture, remote sensing object detection can be effectively achieved using natural language alone. Additionally, the model exhibits the ability to perform certain vision question answering (VQA) tasks. Our datasets and related code will be released soon.

Related papers

Re-Aligning Language to Visual Objects with an Agentic Workflow [73.73778652260911]
Language-based object detection aims to align visual objects with language expressions. Recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects. We propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts.
arXiv Detail & Related papers (2025-03-30T16:41:12Z)
EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing [3.3072144045024396]
EagleVision is an MLLM tailored for remote sensing that excels in object detection and attribute comprehension. We construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning. EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks.
arXiv Detail & Related papers (2025-03-30T06:13:13Z)
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models [0.6715525121432597]
This research presents a novel vision language model (VLM) framework to enhance feature extraction, scalability, and efficiency. We evaluate the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise. Our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV)
arXiv Detail & Related papers (2025-03-08T01:22:10Z)
Generalization-Enhanced Few-Shot Object Detection in Remote Sensing [22.411751110592842]
Few-shot object detection (FSOD) targets object detection challenges in data-limited conditions. We propose the Generalization-Enhanced Few-Shot Object Detection (GE-FSOD) model to improve the generalization capability in remote sensing tasks. Our model introduces three key innovations: the Cross-Level Fusion Pyramid Attention Network (CFPAN), the Multi-Stage Refinement Region Proposal Network (MRRPN), and the Generalized Classification Loss (GCL)
arXiv Detail & Related papers (2025-01-05T08:12:25Z)
Oriented Tiny Object Detection: A Dataset, Benchmark, and Dynamic Unbiased Learning [51.170479006249195]
We introduce a new dataset, benchmark, and a dynamic coarse-to-fine learning scheme in this study. Our proposed dataset, AI-TOD-R, features the smallest object sizes among all oriented object detection datasets. We present a benchmark spanning a broad range of detection paradigms, including both fully-supervised and label-efficient approaches.
arXiv Detail & Related papers (2024-12-16T09:14:32Z)
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity.<n> RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning.<n>We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension [6.29665399879184]
We present Aquila, an advanced visual language foundation model for remote sensing images. Aquila enables richer visual feature representation and more precise visual-language feature alignment. We validate the effectiveness of Aquila through extensive quantitative experiments and qualitative analyses.
arXiv Detail & Related papers (2024-11-09T05:31:56Z)
From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing [16.755590790629153]
This review examines the development and application of multi-modal language models (MLLMs) in remote sensing. We focus on their ability to interpret and describe satellite imagery using natural language. Key applications such as scene description, object detection, change detection, text-to-image retrieval, image-to-text generation, and visual question answering are discussed.
arXiv Detail & Related papers (2024-11-05T12:14:22Z)
Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills. To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding. We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing [2.0528748158119434]
multimodal learning can be used to integrate features from different data modalities, thereby improving detection accuracy. In this paper, we propose to use Masked Image Modeling (MIM) as a pre-training technique, leveraging self-supervised learning on unlabeled data. To address this, we propose a new interactive MIM method that can establish interactions between different tokens, which is particularly beneficial for object detection in remote sensing.
arXiv Detail & Related papers (2024-09-13T14:50:50Z)
RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images. RSM is specifically designed to capture the global context of remote sensing images with linear complexity. Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z)
Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)
Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
Visual Relationship Detection with Visual-Linguistic Knowledge from Multimodal Representations [103.00383924074585]
Visual relationship detection aims to reason over relationships among salient objects in images. We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT) RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
arXiv Detail & Related papers (2020-09-10T16:15:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.