SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction
Tuning with Large Language Model
- URL: http://arxiv.org/abs/2401.09712v1
- Date: Thu, 18 Jan 2024 04:10:20 GMT
- Title: SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction
Tuning with Large Language Model
- Authors: Yang Zhan, Zhitong Xiong, Yuan Yuan
- Abstract summary: We introduce SkyEyeGPT, a unified multi-modal large language model specifically designed for RS vision-language understanding.
With a simple yet effective design, SkyEyeGPT works surprisingly well on considerably different tasks without the need for extra encoding modules.
Experiments on 8 datasets for RS vision-language tasks demonstrate SkyEyeGPT's superiority in image-level and region-level tasks.
- Score: 12.19132018279148
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have recently been extended to the
vision-language realm, obtaining impressive general multi-modal capabilities.
However, the exploration of multi-modal large language models (MLLMs) for
remote sensing (RS) data is still in its infancy, and the performance is not
satisfactory. In this work, we introduce SkyEyeGPT, a unified multi-modal large
language model specifically designed for RS vision-language understanding. To
this end, we meticulously curate an RS multi-modal instruction tuning dataset,
including single-task and multi-task conversation instructions. After manual
verification, we obtain a high-quality RS instruction-following dataset with
968k samples. Our research demonstrates that with a simple yet effective
design, SkyEyeGPT works surprisingly well on considerably different tasks
without the need for extra encoding modules. Specifically, after projecting RS
visual features to the language domain via an alignment layer, they are fed
jointly with task-specific instructions into an LLM-based RS decoder to predict
answers for RS open-ended tasks. In addition, we design a two-stage tuning
method to enhance instruction-following and multi-turn dialogue ability at
different granularities. Experiments on 8 datasets for RS vision-language tasks
demonstrate SkyEyeGPT's superiority in image-level and region-level tasks, such
as captioning and visual grounding. In particular, SkyEyeGPT exhibits
encouraging results compared to GPT-4V in some qualitative tests. The online
demo, code, and dataset will be released in
https://github.com/ZhanYang-nwpu/SkyEyeGPT.
Related papers
- Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models [127.38740043393527]
We propose ViFT, a visual instruction-free fine-tuning framework for LVLMs.
We only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities.
Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks.
arXiv Detail & Related papers (2025-02-17T04:38:12Z) - RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity.
RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning.
We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z) - GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding [31.01378033872341]
GeoGround is a novel framework that unifies support for HBB, OBB, and mask RS visual grounding tasks.
To support model training, we present refGeo, a large-scale RS visual instruction-following dataset containing 161k image-text pairs.
arXiv Detail & Related papers (2024-11-16T05:12:11Z) - LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation [21.91073335335992]
We introduce LHRS-Bot-Nova, an MLLM specialized in understanding remote sensing (RS) images.
LHRS-Bot-Nova features an enhanced vision encoder and a novel bridge layer, enabling efficient visual compression and better language-vision alignment.
Extensive experiments demonstrate superior performance of LHRS-Bot-Nova across various RS image understanding tasks.
arXiv Detail & Related papers (2024-11-14T09:23:40Z) - Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models [3.178739428363249]
We propose a workflow to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform.
Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions.
arXiv Detail & Related papers (2024-08-27T02:45:26Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - GeoChat: Grounded Large Vision-Language Model for Remote Sensing [65.78360056991247]
We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images.
Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue.
GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
arXiv Detail & Related papers (2023-11-24T18:59:10Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z) - BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs [101.50522135049198]
BuboGPT is a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language.
Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image.
Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human.
arXiv Detail & Related papers (2023-07-17T15:51:47Z) - VisionLLM: Large Language Model is also an Open-Ended Decoder for
Vision-Centric Tasks [81.32968995346775]
VisionLLM is a framework for vision-centric tasks that can be flexibly defined and managed using language instructions.
Our model can achieve over 60% mAP on COCO, on par with detection-specific models.
arXiv Detail & Related papers (2023-05-18T17:59:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.