GeoChat: Grounded Large Vision-Language Model for Remote Sensing
- URL: http://arxiv.org/abs/2311.15826v1
- Date: Fri, 24 Nov 2023 18:59:10 GMT
- Title: GeoChat: Grounded Large Vision-Language Model for Remote Sensing
- Authors: Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das,
Salman Khan, Fahad Shahbaz Khan
- Abstract summary: We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images.
Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue.
GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
- Score: 65.78360056991247
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Large Vision-Language Models (VLMs) have shown great
promise in natural image domains, allowing users to hold a dialogue about given
visual content. However, such general-domain VLMs perform poorly for Remote
Sensing (RS) scenarios, leading to inaccurate or fabricated information when
presented with RS domain-specific queries. Such a behavior emerges due to the
unique challenges introduced by RS imagery. For example, to handle
high-resolution RS imagery with diverse scale changes across categories and
many small objects, region-level reasoning is necessary alongside holistic
scene interpretation. Furthermore, the lack of domain-specific multimodal
instruction following data as well as strong backbone models for RS make it
hard for the models to align their behavior with user queries. To address these
limitations, we propose GeoChat - the first versatile remote sensing VLM that
offers multitask conversational capabilities with high-resolution RS images.
Specifically, GeoChat can not only answer image-level queries but also accepts
region inputs to hold region-specific dialogue. Furthermore, it can visually
ground objects in its responses by referring to their spatial coordinates. To
address the lack of domain-specific datasets, we generate a novel RS multimodal
instruction-following dataset by extending image-text pairs from existing
diverse RS datasets. We establish a comprehensive benchmark for RS multitask
conversations and compare with a number of baseline methods. GeoChat
demonstrates robust zero-shot performance on various RS tasks, e.g., image and
region captioning, visual question answering, scene classification, visually
grounded conversations and referring detection. Our code is available at
https://github.com/mbzuai-oryx/geochat.
Related papers
- GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding [31.01378033872341]
GeoGround is a novel framework that unifies support for HBB, OBB, and mask RS visual grounding tasks.
To support model training, we present refGeo, a large-scale RS visual instruction-following dataset containing 161k image-text pairs.
arXiv Detail & Related papers (2024-11-16T05:12:11Z) - CDChat: A Large Multimodal Model for Remote Sensing Change Description [82.51779045271437]
We introduce a change description instruction dataset that can be utilized to finetune an LMM and provide better change descriptions for RS images.
We show that the LLaVA-1.5 model, with slight modifications, can be finetuned on the change description instruction dataset and achieve favorably better performance.
arXiv Detail & Related papers (2024-09-24T17:31:02Z) - RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models [3.178739428363249]
We propose a workflow to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform.
Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions.
arXiv Detail & Related papers (2024-08-27T02:45:26Z) - EarthMarker: Visual Prompt Learning for Region-level and Point-level Remote Sensing Imagery Comprehension [12.9701635989222]
The first visual prompting model named EarthMarker is proposed, which excels in image-level, region-level, and point-level RS imagery interpretation.
To endow the EarthMarker with versatile multi-granularity visual perception abilities, the cross-domain phased learning strategy is developed.
To tackle the lack of RS visual prompting data, a dataset named RSVP featuring multi-modal fine-grained visual prompting instruction is constructed.
arXiv Detail & Related papers (2024-07-18T15:35:00Z) - Evaluating Tool-Augmented Agents in Remote Sensing Platforms [1.8434042562191815]
Existing benchmarks assume question-answering input templates over predefined image-text data pairs.
We present GeoLLM-QA, a benchmark designed to capture long sequences of verbal, visual, and click-based actions on a real UI platform.
arXiv Detail & Related papers (2024-04-23T20:37:24Z) - RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images.
RSM is specifically designed to capture the global context of remote sensing images with linear complexity.
Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis [48.06425266787859]
This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis.
VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD) and an honest instruction dataset comprising both factual and deceptive questions (HnstD)
In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding.
arXiv Detail & Related papers (2024-03-29T14:50:43Z) - Large Language Models for Captioning and Retrieving Remote Sensing
Images [4.499596985198142]
RS-CapRet is a Vision and Language method for remote sensing tasks.
It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
arXiv Detail & Related papers (2024-02-09T15:31:01Z) - GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.
GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input.
Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.