EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor
Image Comprehension in Remote Sensing Domain
- URL: http://arxiv.org/abs/2401.16822v3
- Date: Fri, 8 Mar 2024 15:36:11 GMT
- Title: EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor
Image Comprehension in Remote Sensing Domain
- Authors: Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, Xuerui Mao
- Abstract summary: Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain.
To fill the gap, a pioneer MLLM named EarthGPT integrating various multi-sensor RS interpretation tasks uniformly is proposed in this paper.
- Score: 11.902077343294707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal large language models (MLLMs) have demonstrated remarkable
success in vision and visual-language tasks within the natural image domain.
Owing to the significant diversities between the natural and remote sensing
(RS) images, the development of MLLMs in the RS domain is still in the infant
stage. To fill the gap, a pioneer MLLM named EarthGPT integrating various
multi-sensor RS interpretation tasks uniformly is proposed in this paper for
universal RS image comprehension. In EarthGPT, three key techniques are
developed including a visual-enhanced perception mechanism, a cross-modal
mutual comprehension approach, and a unified instruction tuning method for
multi-sensor multi-task in the RS domain. More importantly, a dataset named
MMRS-1M featuring large-scale multi-sensor multi-modal RS instruction-following
is constructed, comprising over 1M image-text pairs based on 34 existing
diverse RS datasets and including multi-sensor images such as optical,
synthetic aperture radar (SAR), and infrared. The MMRS-1M dataset addresses the
drawback of MLLMs on RS expert knowledge and stimulates the development of
MLLMs in the RS domain. Extensive experiments are conducted, demonstrating the
EarthGPT's superior performance in various RS visual interpretation tasks
compared with the other specialist models and MLLMs, proving the effectiveness
of the proposed EarthGPT and offering a versatile paradigm for open-set
reasoning tasks.
Related papers
- EarthMarker: Visual Prompt Learning for Region-level and Point-level Remote Sensing Imagery Comprehension [12.9701635989222]
The first visual prompting model named EarthMarker is proposed, which excels in image-level, region-level, and point-level RS imagery interpretation.
To endow the EarthMarker with versatile multi-granularity visual perception abilities, the cross-domain phased learning strategy is developed.
To tackle the lack of RS visual prompting data, a dataset named RSVP featuring multi-modal fine-grained visual prompting instruction is constructed.
arXiv Detail & Related papers (2024-07-18T15:35:00Z) - Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset [66.15872913664407]
This study introduces textbfRS-4M, a large-scale dataset designed to enable highly efficient MIM training on RS images.
We propose an efficient MIM method, termed textbfSelectiveMAE, which dynamically encodes and reconstructs a subset of patch tokens selected based on their semantic richness.
Experiments show that SelectiveMAE significantly boosts training efficiency by 2.2-2.7 times and enhances the classification, detection, and segmentation performance of the baseline MIM model.
arXiv Detail & Related papers (2024-06-17T15:41:57Z) - RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images.
RSM is specifically designed to capture the global context of remote sensing images with linear complexity.
Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model [10.280417075859141]
We introduce LHRS-Bot, an MLLM tailored for RS image understanding through a novel vision-language alignment strategy and a curriculum learning method.
Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain.
arXiv Detail & Related papers (2024-02-04T15:46:43Z) - SpectralGPT: Spectral Remote Sensing Foundation Model [60.023956954916414]
A universal RS foundation model, named SpectralGPT, is purpose-built to handle spectral RS images using a novel 3D generative pretrained transformer (GPT)
Compared to existing foundation models, SpectralGPT accommodates input images with varying sizes, resolutions, time series, and regions in a progressive training fashion, enabling full utilization of extensive RS big data.
Our evaluation highlights significant performance improvements with pretrained SpectralGPT models, signifying substantial potential in advancing spectral RS big data applications within the field of geoscience.
arXiv Detail & Related papers (2023-11-13T07:09:30Z) - Feature Guided Masked Autoencoder for Self-supervised Learning in Remote
Sensing [16.683132793313693]
Masked AutoEncoder (MAE) has attracted wide attention for pretraining vision transformers in remote sensing.
We propose Feature Guided Masked Autoencoder (FG-MAE): reconstructing a combination of Histograms of Oriented Graidents (HOG) and Normalized Difference Indices (NDI) for multispectral images, and reconstructing HOG for SAR images.
arXiv Detail & Related papers (2023-10-28T09:43:13Z) - Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote
Sensing Image Retrieval [21.05804942940532]
Cross-modal text-image retrieval has attracted extensive attention for its advantages of flexible input and efficient query.
To cope with the problem of multi-scale scarcity and target redundancy in RS multimodal retrieval task, we come up with a novel asymmetric multimodal feature matching network (AMFMN)
Our model adapts to multi-scale feature inputs, favors multi-source retrieval methods, and can dynamically filter redundant features.
arXiv Detail & Related papers (2022-04-21T03:53:19Z) - Multi-Content Complementation Network for Salient Object Detection in
Optical Remote Sensing Images [108.79667788962425]
salient object detection in optical remote sensing images (RSI-SOD) remains to be a challenging emerging topic.
We propose a novel Multi-Content Complementation Network (MCCNet) to explore the complementarity of multiple content for RSI-SOD.
In MCCM, we consider multiple types of features that are critical to RSI-SOD, including foreground features, edge features, background features, and global image-level features.
arXiv Detail & Related papers (2021-12-02T04:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.