Related papers: GeoChat: Grounded Large Vision-Language Model for Remote Sensing

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

URL: http://arxiv.org/abs/2311.15826v1
Date: Fri, 24 Nov 2023 18:59:10 GMT
Title: GeoChat: Grounded Large Vision-Language Model for Remote Sensing
Authors: Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, Fahad Shahbaz Khan
Abstract summary: We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images. Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue. GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
Score: 65.78360056991247
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Large Vision-Language Models (VLMs) have shown great promise in natural image domains, allowing users to hold a dialogue about given visual content. However, such general-domain VLMs perform poorly for Remote Sensing (RS) scenarios, leading to inaccurate or fabricated information when presented with RS domain-specific queries. Such a behavior emerges due to the unique challenges introduced by RS imagery. For example, to handle high-resolution RS imagery with diverse scale changes across categories and many small objects, region-level reasoning is necessary alongside holistic scene interpretation. Furthermore, the lack of domain-specific multimodal instruction following data as well as strong backbone models for RS make it hard for the models to align their behavior with user queries. To address these limitations, we propose GeoChat - the first versatile remote sensing VLM that offers multitask conversational capabilities with high-resolution RS images. Specifically, GeoChat can not only answer image-level queries but also accepts region inputs to hold region-specific dialogue. Furthermore, it can visually ground objects in its responses by referring to their spatial coordinates. To address the lack of domain-specific datasets, we generate a novel RS multimodal instruction-following dataset by extending image-text pairs from existing diverse RS datasets. We establish a comprehensive benchmark for RS multitask conversations and compare with a number of baseline methods. GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection. Our code is available at https://github.com/mbzuai-oryx/geochat.

Related papers

Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity. RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning. We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z)
GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding [31.01378033872341]
GeoGround is a novel framework that unifies support for HBB, OBB, and mask RS visual grounding tasks. To support model training, we present refGeo, a large-scale RS visual instruction-following dataset containing 161k image-text pairs.
arXiv Detail & Related papers (2024-11-16T05:12:11Z)
CDChat: A Large Multimodal Model for Remote Sensing Change Description [82.51779045271437]
We introduce a change description instruction dataset that can be utilized to finetune an LMM and provide better change descriptions for RS images. We show that the LLaVA-1.5 model, with slight modifications, can be finetuned on the change description instruction dataset and achieve favorably better performance.
arXiv Detail & Related papers (2024-09-24T17:31:02Z)
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models [3.178739428363249]
We propose a workflow to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform. Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions.
arXiv Detail & Related papers (2024-08-27T02:45:26Z)
EarthMarker: Visual Prompt Learning for Region-level and Point-level Remote Sensing Imagery Comprehension [12.9701635989222]
The first visual prompting model named EarthMarker is proposed, which excels in image-level, region-level, and point-level RS imagery interpretation. To endow the EarthMarker with versatile multi-granularity visual perception abilities, the cross-domain phased learning strategy is developed. To tackle the lack of RS visual prompting data, a dataset named RSVP featuring multi-modal fine-grained visual prompting instruction is constructed.
arXiv Detail & Related papers (2024-07-18T15:35:00Z)
Evaluating Tool-Augmented Agents in Remote Sensing Platforms [1.8434042562191815]
Existing benchmarks assume question-answering input templates over predefined image-text data pairs. We present GeoLLM-QA, a benchmark designed to capture long sequences of verbal, visual, and click-based actions on a real UI platform.
arXiv Detail & Related papers (2024-04-23T20:37:24Z)
RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images. RSM is specifically designed to capture the global context of remote sensing images with linear complexity. Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis [48.06425266787859]
This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis. VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD) and an honest instruction dataset comprising both factual and deceptive questions (HnstD) In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding.
arXiv Detail & Related papers (2024-03-29T14:50:43Z)
Large Language Models for Captioning and Retrieving Remote Sensing Images [4.499596985198142]
RS-CapRet is a Vision and Language method for remote sensing tasks. It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
arXiv Detail & Related papers (2024-02-09T15:31:01Z)
GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.