SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
Remote Sensing
- URL: http://arxiv.org/abs/2312.12856v1
- Date: Wed, 20 Dec 2023 09:19:48 GMT
- Title: SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
Remote Sensing
- Authors: Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, Ram
Rajagopal
- Abstract summary: We construct a vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags.
With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification.
It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval.
- Score: 14.79627534702196
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Remote sensing imagery, despite its broad applications in helping achieve
Sustainable Development Goals and tackle climate change, has not yet benefited
from the recent advancements of versatile, task-agnostic vision language models
(VLMs). A key reason is that the large-scale, semantically diverse image-text
dataset required for developing VLMs is still absent for remote sensing images.
Unlike natural images, remote sensing images and their associated text
descriptions cannot be efficiently collected from the public Internet at scale.
In this work, we bridge this gap by using geo-coordinates to automatically
connect open, unlabeled remote sensing images with rich semantics covered in
OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language
dataset for remote sensing images, comprising 2.6 million image-text pairs
covering 29K distinct semantic tags. With continual pre-training on this
dataset, we obtain a VLM that surpasses baseline models with a 6.2% average
accuracy gain in zero-shot scene classification across seven benchmark
datasets. It also demonstrates the ability of zero-shot transfer for
fine-grained object attribute classification and cross-modal retrieval. We hope
this dataset can support the advancement of VLMs for various multi-modal tasks
in remote sensing, such as open-vocabulary classification, retrieval,
captioning, and text-to-image synthesis.
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Large Language Models for Captioning and Retrieving Remote Sensing
Images [4.499596985198142]
RS-CapRet is a Vision and Language method for remote sensing tasks.
It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
arXiv Detail & Related papers (2024-02-09T15:31:01Z) - MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images [7.163236160505616]
We propose a novel metadata-collaborative multimodal segmentation network (MetaSegNet) for semantic segmentation of remote sensing images.
Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic from freely available remote sensing image metadata.
We construct an image encoder, a text encoder and a cross-modal attention fusion subnetwork to extract the image and text feature and apply image-text interaction.
arXiv Detail & Related papers (2023-12-20T03:16:34Z) - Remote Sensing Vision-Language Foundation Models without Annotations via
Ground Remote Alignment [61.769441954135246]
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations.
Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language.
arXiv Detail & Related papers (2023-12-12T03:39:07Z) - MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z) - Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification.
An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z) - Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA
Tasks? A: Self-Train on Unlabeled Images! [103.09776737512077]
SelTDA (Self-Taught Data Augmentation) is a strategy for finetuning large vision language models on small-scale VQA datasets.
It generates question-answer pseudolabels directly conditioned on an image, allowing us to pseudolabel unlabeled images.
We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions.
arXiv Detail & Related papers (2023-06-06T18:00:47Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - MLRSNet: A Multi-label High Spatial Resolution Remote Sensing Dataset
for Semantic Scene Understanding [6.880271407391406]
We construct a multi-label high spatial resolution remote sensing dataset named MLRSNet for semantic scene understanding with deep learning.
MLRSNet contains 109,161 samples within 46 scene categories, and each image has at least one of 60 predefined labels.
The experimental results demonstrate that MLRSNet is a significant benchmark for future research.
arXiv Detail & Related papers (2020-10-01T08:03:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.