MLRSNet: A Multi-label High Spatial Resolution Remote Sensing Dataset
for Semantic Scene Understanding
- URL: http://arxiv.org/abs/2010.00243v1
- Date: Thu, 1 Oct 2020 08:03:47 GMT
- Title: MLRSNet: A Multi-label High Spatial Resolution Remote Sensing Dataset
for Semantic Scene Understanding
- Authors: Xiaoman Qi, PanPan Zhu, Yuebin Wang, Liqiang Zhang, Junhuan Peng,
Mengfan Wu, Jialong Chen, Xudong Zhao, Ning Zang, P.Takis Mathiopoulos
- Abstract summary: We construct a multi-label high spatial resolution remote sensing dataset named MLRSNet for semantic scene understanding with deep learning.
MLRSNet contains 109,161 samples within 46 scene categories, and each image has at least one of 60 predefined labels.
The experimental results demonstrate that MLRSNet is a significant benchmark for future research.
- Score: 6.880271407391406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To better understand scene images in the field of remote sensing, multi-label
annotation of scene images is necessary. Moreover, to enhance the performance
of deep learning models for dealing with semantic scene understanding tasks, it
is vital to train them on large-scale annotated data. However, most existing
datasets are annotated by a single label, which cannot describe the complex
remote sensing images well because scene images might have multiple land cover
classes. Few multi-label high spatial resolution remote sensing datasets have
been developed to train deep learning models for multi-label based tasks, such
as scene classification and image retrieval. To address this issue, in this
paper, we construct a multi-label high spatial resolution remote sensing
dataset named MLRSNet for semantic scene understanding with deep learning from
the overhead perspective. It is composed of high-resolution optical satellite
or aerial images. MLRSNet contains a total of 109,161 samples within 46 scene
categories, and each image has at least one of 60 predefined labels. We have
designed visual recognition tasks, including multi-label based image
classification and image retrieval, in which a wide variety of deep learning
approaches are evaluated with MLRSNet. The experimental results demonstrate
that MLRSNet is a significant benchmark for future research, and it complements
the current widely used datasets such as ImageNet, which fills gaps in
multi-label image research. Furthermore, we will continue to expand the
MLRSNet. MLRSNet and all related materials have been made publicly available at
https://data.mendeley.com/datasets/7j9bv9vwsx/2 and
https://github.com/cugbrs/MLRSNet.git.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information.
In this study, we find that the intermediate layers of models can encode more global semantic information.
We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z) - SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
Remote Sensing [14.79627534702196]
We construct a vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags.
With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification.
It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval.
arXiv Detail & Related papers (2023-12-20T09:19:48Z) - CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding.
CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches.
With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z) - BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations [89.42397034542189]
We synthesize a large labeled dataset via a generative adversarial network (GAN)
We take image samples from the class-conditional generative model BigGAN trained on ImageNet, and manually annotate 5 images per class, for all 1k classes.
We create a new ImageNet benchmark by labeling an additional set of 8k real images and evaluate segmentation performance in a variety of settings.
arXiv Detail & Related papers (2022-01-12T20:28:34Z) - Aerial Scene Parsing: From Tile-level Scene Classification to Pixel-wise
Semantic Labeling [48.30060717413166]
Given an aerial image, aerial scene parsing (ASP) targets to interpret the semantic structure of the image content by assigning a semantic label to every pixel of the image.
We present a large-scale scene classification dataset that contains one million aerial images termed Million-AID.
We also report benchmarking experiments using classical convolutional neural networks (CNNs) to achieve pixel-wise semantic labeling.
arXiv Detail & Related papers (2022-01-06T07:40:47Z) - Remote Sensing Images Semantic Segmentation with General Remote Sensing
Vision Model via a Self-Supervised Contrastive Learning Method [13.479068312825781]
We propose Global style and Local matching Contrastive Learning Network (GLCNet) for remote sensing semantic segmentation.
Specifically, the global style contrastive module is used to learn an image-level representation better.
The local features matching contrastive module is designed to learn representations of local regions which is beneficial for semantic segmentation.
arXiv Detail & Related papers (2021-06-20T03:03:40Z) - MultiScene: A Large-scale Dataset and Benchmark for Multi-scene
Recognition in Single Aerial Images [17.797726722637634]
We create a large-scale dataset, called MultiScene, composed of 100,000 high-resolution aerial images.
We visually inspect 14,000 images and correct their scene labels, yielding a subset of cleanly-annotated images, named MultiScene-Clean.
We conduct experiments with extensive baseline models on both MultiScene-Clean and MultiScene to offer benchmarks for multi-scene recognition in single images.
arXiv Detail & Related papers (2021-04-07T01:09:12Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - X-ModalNet: A Semi-Supervised Deep Cross-Modal Network for
Classification of Remote Sensing Data [69.37597254841052]
We propose a novel cross-modal deep-learning framework called X-ModalNet.
X-ModalNet generalizes well, owing to propagating labels on an updatable graph constructed by high-level features on the top of the network.
We evaluate X-ModalNet on two multi-modal remote sensing datasets (HSI-MSI and HSI-SAR) and achieve a significant improvement in comparison with several state-of-the-art methods.
arXiv Detail & Related papers (2020-06-24T15:29:41Z) - RGB-based Semantic Segmentation Using Self-Supervised Depth Pre-Training [77.62171090230986]
We propose an easily scalable and self-supervised technique that can be used to pre-train any semantic RGB segmentation method.
In particular, our pre-training approach makes use of automatically generated labels that can be obtained using depth sensors.
We show how our proposed self-supervised pre-training with HN-labels can be used to replace ImageNet pre-training.
arXiv Detail & Related papers (2020-02-06T11:16:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.