RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large
Vision-Language Model for Remote Sensing
- URL: http://arxiv.org/abs/2306.11300v5
- Date: Tue, 2 Jan 2024 14:18:02 GMT
- Title: RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large
Vision-Language Model for Remote Sensing
- Authors: Zilun Zhang, Tiancheng Zhao, Yulong Guo, Jianwei Yin
- Abstract summary: We propose a new framework that includes the Domain pre-trained Vision-Language Model (DVLM)
We present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions.
- Score: 26.71560933421903
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Pre-trained Vision-Language Models (VLMs) utilizing extensive image-text
paired data have demonstrated unprecedented image-text association
capabilities, achieving remarkable results across various downstream tasks. A
critical challenge is how to make use of existing large-scale pre-trained VLMs,
which are trained on common objects, to perform the domain-specific transfer
for accomplishing domain-related downstream tasks. A critical challenge is how
to make use of existing large-scale pre-trained VLMs, which are trained on
common objects, to perform the domain-specific transfer for accomplishing
domain-related downstream tasks. In this paper, we propose a new framework that
includes the Domain pre-trained Vision-Language Model (DVLM), bridging the gap
between the General Vision-Language Model (GVLM) and domain-specific downstream
tasks. Moreover, we present an image-text paired dataset in the field of remote
sensing (RS), RS5M, which has 5 million RS images with English descriptions.
The dataset is obtained from filtering publicly available image-text paired
datasets and captioning label-only RS datasets with pre-trained VLM. These
constitute the first large-scale RS image-text paired dataset. Additionally, we
fine-tuned the CLIP model and tried several Parameter-Efficient Fine-Tuning
methods on RS5M to implement the DVLM. Experimental results show that our
proposed dataset is highly effective for various tasks, and our model GeoRSCLIP
improves upon the baseline or previous state-of-the-art model by $3\%\sim20\%$
in Zero-shot Classification (ZSC), $3\%\sim6\%$ in Remote Sensing Cross-Modal
Text-Image Retrieval (RSCTIR) and $4\%\sim5\%$ in Semantic Localization (SeLo)
tasks. Dataset and models have been released in:
\url{https://github.com/om-ai-lab/RS5M}.
Related papers
- Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset [66.15872913664407]
This study introduces textbfRS-4M, a large-scale dataset designed to enable highly efficient MIM training on RS images.
We propose an efficient MIM method, termed textbfSelectiveMAE, which dynamically encodes and reconstructs a subset of patch tokens selected based on their semantic richness.
Experiments show that SelectiveMAE significantly boosts training efficiency by 2.2-2.7 times and enhances the classification, detection, and segmentation performance of the baseline MIM model.
arXiv Detail & Related papers (2024-06-17T15:41:57Z) - VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization [115.64739269488965]
VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
arXiv Detail & Related papers (2024-04-30T15:49:03Z) - Bridge the Modality and Capacity Gaps in Vision-Language Model Selection [60.049430086731846]
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
A promising zero-shot image classification strategy is selecting the most appropriate Pre-Trained VLM from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of these two gaps.
arXiv Detail & Related papers (2024-03-20T17:54:58Z) - SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
Remote Sensing [14.79627534702196]
We construct a vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags.
With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification.
It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval.
arXiv Detail & Related papers (2023-12-20T09:19:48Z) - MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z) - GeoChat: Grounded Large Vision-Language Model for Remote Sensing [65.78360056991247]
We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images.
Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue.
GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
arXiv Detail & Related papers (2023-11-24T18:59:10Z) - RSGPT: A Remote Sensing Vision Language Model and Benchmark [7.279747655485913]
We build a high-quality Remote Sensing Image Captioning dataset (RSICap)
This dataset comprises 2,585 human-annotated captions with rich and high-quality information.
We also provide a benchmark evaluation dataset called RSIEval.
arXiv Detail & Related papers (2023-07-28T02:23:35Z) - RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing
Data [14.742224345061487]
We introduce the task of visual grounding for remote sensing data (RSVG)
RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language.
In this work, we construct a large-scale benchmark dataset of RSVG and explore deep learning models for the RSVG task.
arXiv Detail & Related papers (2022-10-23T07:08:22Z) - WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models [2.603259641572195]
We introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs.
About 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation.
We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training.
arXiv Detail & Related papers (2022-03-22T06:12:20Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.