MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification with Zoom-Free Remote Sensing Imagery
- URL: http://arxiv.org/abs/2503.11219v1
- Date: Fri, 14 Mar 2025 09:10:45 GMT
- Title: MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification with Zoom-Free Remote Sensing Imagery
- Authors: Yansheng Li, Yuning Wu, Gong Cheng, Chao Tao, Bo Dang, Yu Wang, Jiahao Zhang, Chuge Zhang, Yiting Liu, Xu Tang, Jiayi Ma, Yongjun Zhang,
- Abstract summary: We introduce the Million-scale finE-grained geospatial scEne classification dataseT (MEET)<n>MEET contains over 1.03 million zoom-free remote sensing scene samples, manually annotated into 80 fine-grained categories.<n>To tackle the emerging challenge of scene-in-scene classification, we present the Context-Aware Transformer (CAT)
- Score: 37.588938028708405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate fine-grained geospatial scene classification using remote sensing imagery is essential for a wide range of applications. However, existing approaches often rely on manually zooming remote sensing images at different scales to create typical scene samples. This approach fails to adequately support the fixed-resolution image interpretation requirements in real-world scenarios. To address this limitation, we introduce the Million-scale finE-grained geospatial scEne classification dataseT (MEET), which contains over 1.03 million zoom-free remote sensing scene samples, manually annotated into 80 fine-grained categories. In MEET, each scene sample follows a scene-inscene layout, where the central scene serves as the reference, and auxiliary scenes provide crucial spatial context for finegrained classification. Moreover, to tackle the emerging challenge of scene-in-scene classification, we present the Context-Aware Transformer (CAT), a model specifically designed for this task, which adaptively fuses spatial context to accurately classify the scene samples. CAT adaptively fuses spatial context to accurately classify the scene samples by learning attentional features that capture the relationships between the center and auxiliary scenes. Based on MEET, we establish a comprehensive benchmark for fine-grained geospatial scene classification, evaluating CAT against 11 competitive baselines. The results demonstrate that CAT significantly outperforms these baselines, achieving a 1.88% higher balanced accuracy (BA) with the Swin-Large backbone, and a notable 7.87% improvement with the Swin-Huge backbone. Further experiments validate the effectiveness of each module in CAT and show the practical applicability of CAT in the urban functional zone mapping. The source code and dataset will be publicly available at https://jerrywyn.github.io/project/MEET.html.
Related papers
- Fine-grained Recognition with Learnable Semantic Data Augmentation [68.48892326854494]
Fine-grained image recognition is a longstanding computer vision challenge.
We propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem.
Our method significantly improves the generalization performance on several popular classification networks.
arXiv Detail & Related papers (2023-09-01T11:15:50Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - A Comprehensive Study of Image Classification Model Sensitivity to
Foregrounds, Backgrounds, and Visual Attributes [58.633364000258645]
We call this dataset RIVAL10 consisting of roughly $26k$ instances over $10$ classes.
We evaluate the sensitivity of a broad set of models to noise corruptions in foregrounds, backgrounds and attributes.
In our analysis, we consider diverse state-of-the-art architectures (ResNets, Transformers) and training procedures (CLIP, SimCLR, DeiT, Adversarial Training)
arXiv Detail & Related papers (2022-01-26T06:31:28Z) - Aerial Scene Parsing: From Tile-level Scene Classification to Pixel-wise
Semantic Labeling [48.30060717413166]
Given an aerial image, aerial scene parsing (ASP) targets to interpret the semantic structure of the image content by assigning a semantic label to every pixel of the image.
We present a large-scale scene classification dataset that contains one million aerial images termed Million-AID.
We also report benchmarking experiments using classical convolutional neural networks (CNNs) to achieve pixel-wise semantic labeling.
arXiv Detail & Related papers (2022-01-06T07:40:47Z) - SGMNet: Scene Graph Matching Network for Few-Shot Remote Sensing Scene
Classification [14.016637774748677]
Few-Shot Remote Sensing Scene Classification (FSRSSC) is an important task, which aims to recognize novel scene classes with few examples.
We propose a novel scene graph matching-based meta-learning framework for FSRSSC, called SGMNet.
We conduct extensive experiments on UCMerced LandUse, WHU19, AID, and NWPU-RESISC45 datasets.
arXiv Detail & Related papers (2021-10-09T07:43:40Z) - Free Lunch for Co-Saliency Detection: Context Adjustment [14.688461235328306]
We propose a "cost-free" group-cut-paste (GCP) procedure to leverage images from off-the-shelf saliency detection datasets and synthesize new samples.
We collect a novel dataset called Context Adjustment Training. The two variants of our dataset, i.e., CAT and CAT+, consist of 16,750 and 33,500 images, respectively.
arXiv Detail & Related papers (2021-08-04T14:51:37Z) - Detecting Cattle and Elk in the Wild from Space [6.810164473908359]
Localizing and counting large ungulates in satellite imagery is an important task for supporting ecological studies.
We propose a baseline method, CowNet, that simultaneously estimates the number of animals in an image (counts) and predicts their location at a pixel level (localizes)
We specifically test the temporal generalization of the resulting models over a large landscape in Point Reyes Seashore, CA.
arXiv Detail & Related papers (2021-06-29T14:35:23Z) - CAT: Cross-Attention Transformer for One-Shot Object Detection [32.50786038822194]
One-shot object detection aims to detect all instances of that class in a target image through semantic similarity comparison.
We present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection.
arXiv Detail & Related papers (2021-04-30T13:18:53Z) - iFAN: Image-Instance Full Alignment Networks for Adaptive Object
Detection [48.83883375118966]
iFAN aims to precisely align feature distributions on both image and instance levels.
It outperforms state-of-the-art methods with a boost of 10%+ AP over the source-only baseline.
arXiv Detail & Related papers (2020-03-09T13:27:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.