VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding
- URL: http://arxiv.org/abs/2406.12384v2
- Date: Mon, 11 Nov 2024 17:25:20 GMT
- Title: VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding
- Authors: Xiang Li, Jian Ding, Mohamed Elhoseiny,
- Abstract summary: We present a Versatile vision-language Benchmark for Remote Sensing image understanding, termed VRSBench.
This benchmark comprises 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs.
We further evaluated state-of-the-art models on this benchmark for three vision-language tasks: image captioning, visual grounding, and visual question answering.
- Score: 41.74095171149082
- License:
- Abstract: We introduce a new benchmark designed to advance the development of general-purpose, large-scale vision-language models for remote sensing images. Although several vision-language datasets in remote sensing have been proposed to pursue this goal, existing datasets are typically tailored to single tasks, lack detailed object information, or suffer from inadequate quality control. Exploring these improvement opportunities, we present a Versatile vision-language Benchmark for Remote Sensing image understanding, termed VRSBench. This benchmark comprises 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. It facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks. We further evaluated state-of-the-art models on this benchmark for three vision-language tasks: image captioning, visual grounding, and visual question answering. Our work aims to significantly contribute to the development of advanced vision-language models in the field of remote sensing. The data and code can be accessed at https://github.com/lx709/VRSBench.
Related papers
- Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension [6.29665399879184]
We present Aquila, an advanced visual language foundation model for remote sensing images.
Aquila enables richer visual feature representation and more precise visual-language feature alignment.
We validate the effectiveness of Aquila through extensive quantitative experiments and qualitative analyses.
arXiv Detail & Related papers (2024-11-09T05:31:56Z) - Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations [5.065947993017157]
This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model.
We amassed approximately 9.6 million vision-language paired datasets in VHR imagery.
The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets.
arXiv Detail & Related papers (2024-09-11T06:36:08Z) - VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis [48.06425266787859]
This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis.
VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD) and an honest instruction dataset comprising both factual and deceptive questions (HnstD)
In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding.
arXiv Detail & Related papers (2024-03-29T14:50:43Z) - Large Language Models for Captioning and Retrieving Remote Sensing
Images [4.499596985198142]
RS-CapRet is a Vision and Language method for remote sensing tasks.
It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
arXiv Detail & Related papers (2024-02-09T15:31:01Z) - SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
Remote Sensing [14.79627534702196]
We construct a vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags.
With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification.
It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval.
arXiv Detail & Related papers (2023-12-20T09:19:48Z) - MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images [7.0622873873577054]
We propose a novel metadata-collaborative segmentation network (MetaSegNet) for semantic segmentation of remote sensing images.
Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic from freely available remote sensing image metadata.
We construct an image encoder, a text encoder, and a crossmodal attention fusion subnetwork to extract the image and text feature.
arXiv Detail & Related papers (2023-12-20T03:16:34Z) - Remote Sensing Vision-Language Foundation Models without Annotations via
Ground Remote Alignment [61.769441954135246]
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations.
Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language.
arXiv Detail & Related papers (2023-12-12T03:39:07Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z) - One-Shot Object Affordance Detection in the Wild [76.46484684007706]
Affordance detection refers to identifying the potential action possibilities of objects in an image.
We devise a One-Shot Affordance Detection Network (OSAD-Net) that estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images.
With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods.
arXiv Detail & Related papers (2021-08-08T14:53:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.