LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model
- URL: http://arxiv.org/abs/2402.02544v4
- Date: Tue, 16 Jul 2024 01:40:34 GMT
- Title: LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model
- Authors: Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, Pengfeng Xiao,
- Abstract summary: We introduce LHRS-Bot, an MLLM tailored for RS image understanding through a novel vision-language alignment strategy and a curriculum learning method.
Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain.
- Score: 10.280417075859141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The revolutionary capabilities of large language models (LLMs) have paved the way for multimodal large language models (MLLMs) and fostered diverse applications across various specialized domains. In the remote sensing (RS) field, however, the diverse geographical landscapes and varied objects in RS imagery are not adequately considered in recent MLLM endeavors. To bridge this gap, we construct a large-scale RS image-text dataset, LHRS-Align, and an informative RS-specific instruction dataset, LHRS-Instruct, leveraging the extensive volunteered geographic information (VGI) and globally available RS images. Building on this foundation, we introduce LHRS-Bot, an MLLM tailored for RS image understanding through a novel multi-level vision-language alignment strategy and a curriculum learning method. Additionally, we introduce LHRS-Bench, a benchmark for thoroughly evaluating MLLMs' abilities in RS image understanding. Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain.
Related papers
- CDChat: A Large Multimodal Model for Remote Sensing Change Description [82.51779045271437]
We introduce a change description instruction dataset that can be utilized to finetune an LMM and provide better change descriptions for RS images.
We show that the LLaVA-1.5 model, with slight modifications, can be finetuned on the change description instruction dataset and achieve favorably better performance.
arXiv Detail & Related papers (2024-09-24T17:31:02Z) - RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models [3.178739428363249]
We propose a workflow to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform.
Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions.
arXiv Detail & Related papers (2024-08-27T02:45:26Z) - Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images.
Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights.
Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z) - EarthMarker: Visual Prompt Learning for Region-level and Point-level Remote Sensing Imagery Comprehension [12.9701635989222]
The first visual prompting model named EarthMarker is proposed, which excels in image-level, region-level, and point-level RS imagery interpretation.
To endow the EarthMarker with versatile multi-granularity visual perception abilities, the cross-domain phased learning strategy is developed.
To tackle the lack of RS visual prompting data, a dataset named RSVP featuring multi-modal fine-grained visual prompting instruction is constructed.
arXiv Detail & Related papers (2024-07-18T15:35:00Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - OpticalRS-4M: Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset [66.15872913664407]
We present a new pre-training pipeline for RS models, featuring the creation of a large-scale RS dataset and an efficient MIM approach.
We curated a high-quality dataset named OpticalRS-4M by collecting publicly available RS datasets and processing them through exclusion, slicing, and deduplication.
Experiments demonstrate that OpticalRS-4M significantly improves classification, detection, and segmentation performance, while SelectiveMAE increases training efficiency over 2 times.
arXiv Detail & Related papers (2024-06-17T15:41:57Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor
Image Comprehension in Remote Sensing Domain [11.902077343294707]
Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain.
To fill the gap, a pioneer MLLM named EarthGPT integrating various multi-sensor RS interpretation tasks uniformly is proposed in this paper.
arXiv Detail & Related papers (2024-01-30T08:57:48Z) - GeoChat: Grounded Large Vision-Language Model for Remote Sensing [65.78360056991247]
We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images.
Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue.
GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
arXiv Detail & Related papers (2023-11-24T18:59:10Z) - CMID: A Unified Self-Supervised Learning Framework for Remote Sensing
Image Understanding [20.2438336674081]
Contrastive Mask Image Distillation (CMID) is capable of learning representations with both global semantic separability and local spatial perceptibility.
CMID is compatible with both convolutional neural networks (CNN) and vision transformers (ViT)
Models pre-trained using CMID achieve better performance than other state-of-the-art SSL methods on multiple downstream tasks.
arXiv Detail & Related papers (2023-04-19T13:58:31Z) - RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing
Data [14.742224345061487]
We introduce the task of visual grounding for remote sensing data (RSVG)
RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language.
In this work, we construct a large-scale benchmark dataset of RSVG and explore deep learning models for the RSVG task.
arXiv Detail & Related papers (2022-10-23T07:08:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.