SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
Remote Sensing
- URL: http://arxiv.org/abs/2312.12856v1
- Date: Wed, 20 Dec 2023 09:19:48 GMT
- Title: SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
Remote Sensing
- Authors: Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, Ram
Rajagopal
- Abstract summary: We construct a vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags.
With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification.
It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval.
- Score: 14.79627534702196
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Remote sensing imagery, despite its broad applications in helping achieve
Sustainable Development Goals and tackle climate change, has not yet benefited
from the recent advancements of versatile, task-agnostic vision language models
(VLMs). A key reason is that the large-scale, semantically diverse image-text
dataset required for developing VLMs is still absent for remote sensing images.
Unlike natural images, remote sensing images and their associated text
descriptions cannot be efficiently collected from the public Internet at scale.
In this work, we bridge this gap by using geo-coordinates to automatically
connect open, unlabeled remote sensing images with rich semantics covered in
OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language
dataset for remote sensing images, comprising 2.6 million image-text pairs
covering 29K distinct semantic tags. With continual pre-training on this
dataset, we obtain a VLM that surpasses baseline models with a 6.2% average
accuracy gain in zero-shot scene classification across seven benchmark
datasets. It also demonstrates the ability of zero-shot transfer for
fine-grained object attribute classification and cross-modal retrieval. We hope
this dataset can support the advancement of VLMs for various multi-modal tasks
in remote sensing, such as open-vocabulary classification, retrieval,
captioning, and text-to-image synthesis.
Related papers
- Multilingual Vision-Language Pre-training for the Remote Sensing Domain [4.118895088882213]
Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data.
This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model.
Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks.
arXiv Detail & Related papers (2024-10-30T18:13:11Z) - RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models [3.178739428363249]
We propose a workflow to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform.
Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions.
arXiv Detail & Related papers (2024-08-27T02:45:26Z) - VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis [48.06425266787859]
This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis.
VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD) and an honest instruction dataset comprising both factual and deceptive questions (HnstD)
In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding.
arXiv Detail & Related papers (2024-03-29T14:50:43Z) - Bridge the Modality and Capability Gaps in Vision-Language Model Selection [62.26769826687365]
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
To better reuse the VLM resource, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging to mitigate the negative impact of two gaps.
arXiv Detail & Related papers (2024-03-20T17:54:58Z) - Large Language Models for Captioning and Retrieving Remote Sensing
Images [4.499596985198142]
RS-CapRet is a Vision and Language method for remote sensing tasks.
It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
arXiv Detail & Related papers (2024-02-09T15:31:01Z) - MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images [7.0622873873577054]
We propose a novel metadata-collaborative segmentation network (MetaSegNet) for semantic segmentation of remote sensing images.
Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic from freely available remote sensing image metadata.
We construct an image encoder, a text encoder, and a crossmodal attention fusion subnetwork to extract the image and text feature.
arXiv Detail & Related papers (2023-12-20T03:16:34Z) - Remote Sensing Vision-Language Foundation Models without Annotations via
Ground Remote Alignment [61.769441954135246]
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations.
Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language.
arXiv Detail & Related papers (2023-12-12T03:39:07Z) - MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z) - Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA
Tasks? A: Self-Train on Unlabeled Images! [103.09776737512077]
SelTDA (Self-Taught Data Augmentation) is a strategy for finetuning large vision language models on small-scale VQA datasets.
It generates question-answer pseudolabels directly conditioned on an image, allowing us to pseudolabel unlabeled images.
We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions.
arXiv Detail & Related papers (2023-06-06T18:00:47Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.