SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using
Vision-Language Models
- URL: http://arxiv.org/abs/2304.11619v1
- Date: Sun, 23 Apr 2023 11:23:05 GMT
- Title: SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using
Vision-Language Models
- Authors: Jonathan Roberts, Kai Han, Samuel Albanie
- Abstract summary: We introduce SATellite ImageNet (SATIN), a metadataset curated from 27 existing remotely sensed datasets.
We comprehensively evaluate the zero-shot transfer classification capabilities of a broad range of vision-language (VL) models on SATIN.
We find SATIN to be a challenging benchmark-the strongest method we evaluate achieves a classification accuracy of 52.0%.
- Score: 33.814335088752046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interpreting remote sensing imagery enables numerous downstream applications
ranging from land-use planning to deforestation monitoring. Robustly
classifying this data is challenging due to the Earth's geographic diversity.
While many distinct satellite and aerial image classification datasets exist,
there is yet to be a benchmark curated that suitably covers this diversity. In
this work, we introduce SATellite ImageNet (SATIN), a metadataset curated from
27 existing remotely sensed datasets, and comprehensively evaluate the
zero-shot transfer classification capabilities of a broad range of
vision-language (VL) models on SATIN. We find SATIN to be a challenging
benchmark-the strongest method we evaluate achieves a classification accuracy
of 52.0%. We provide a $\href{https://satinbenchmark.github.io}{\text{public
leaderboard}}$ to guide and track the progress of VL models in this important
domain.
Related papers
- GeoVision Labeler: Zero-Shot Geospatial Classification with Vision and Language Models [3.5759681393339697]
We introduce GeoVision Labeler (GVL), a strictly zero-shot classification framework.<n>GVL generates rich, human-readable image descriptions, which are then mapped to user-defined classes.<n>It achieves up to 93.2% zero-shot accuracy on the binary Buildings vs. No Buildings task on SpaceNet v7.
arXiv Detail & Related papers (2025-05-30T08:32:37Z) - COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking [52.62149024881728]
We propose a contrastive one-stage transformer fusion framework for vision-language (VL) tracking.
We introduce a contrastive alignment strategy that maximizes mutual information between a video and its corresponding language description.
By leveraging a visual-linguistic transformer, we establish an efficient multi-modal fusion and reasoning mechanism.
arXiv Detail & Related papers (2025-04-02T03:12:38Z) - Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark [6.693781685584959]
Low-altitude, Multi-view UAV AVL presents greater challenges due to extreme viewpoint changes.
This benchmark revealed the challenges of Low-altitude, Multi-view UAV AVL and provided valuable guidance for future research.
arXiv Detail & Related papers (2025-03-12T03:29:27Z) - A Recipe for Improving Remote Sensing VLM Zero Shot Generalization [0.4427533728730559]
We present two novel image-caption datasets for training of remote sensing foundation models.
The first dataset pairs aerial and satellite imagery with captions generated by Gemini using landmarks extracted from Google Maps.
The second dataset utilizes public web images and their corresponding alt-text, filtered for the remote sensing domain.
arXiv Detail & Related papers (2025-03-10T21:09:02Z) - GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.
Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.
We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z) - SiamSeg: Self-Training with Contrastive Learning for Unsupervised Domain Adaptation Semantic Segmentation in Remote Sensing [13.549403813487022]
Unsupervised domain adaptation (UDA) enables models to learn from unlabeled target domain data while leveraging labeled source domain data.
We propose integrating contrastive learning into UDA, enhancing the model's ability to capture semantic information in the target domain.
Our method, SimSeg, outperforms existing approaches, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-10-17T11:59:39Z) - VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding [41.74095171149082]
We present a Versatile vision-language Benchmark for Remote Sensing image understanding, termed VRSBench.
This benchmark comprises 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs.
We further evaluated state-of-the-art models on this benchmark for three vision-language tasks: image captioning, visual grounding, and visual question answering.
arXiv Detail & Related papers (2024-06-18T08:15:21Z) - FRACTAL: An Ultra-Large-Scale Aerial Lidar Dataset for 3D Semantic Segmentation of Diverse Landscapes [0.0]
We present an ultra-large-scale aerial Lidar dataset made of 100,000 dense point clouds with high quality labels for 7 semantic classes.
We describe the data collection, annotation, and curation process of the dataset.
We provide baseline semantic segmentation results using a state of the art 3D point cloud classification model.
arXiv Detail & Related papers (2024-05-07T19:37:22Z) - FlightScope: A Deep Comprehensive Review of Aircraft Detection Algorithms in Satellite Imagery [2.9687381456164004]
This paper critically evaluates and compares a suite of advanced object detection algorithms customized for the task of identifying aircraft within satellite imagery.
This research encompasses an array of methodologies including YOLO versions 5 and 8, Faster RCNN, CenterNet, RetinaNet, RTMDet, and DETR, all trained from scratch.
YOLOv5 emerges as a robust solution for aerial object detection, underlining its importance through superior mean average precision, Recall, and Intersection over Union scores.
arXiv Detail & Related papers (2024-04-03T17:24:27Z) - Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z) - SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
Remote Sensing [14.79627534702196]
We construct a vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags.
With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification.
It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval.
arXiv Detail & Related papers (2023-12-20T09:19:48Z) - What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation [2.7036595757881323]
We build a benchmark for Multi-domain Evaluation of Semantic (MESS)
MESS allows a holistic analysis of performance across a wide range of domain-specific datasets.
We evaluate eight recently published models on the proposed MESS benchmark and analyze characteristics for the performance of zero-shot transfer models.
arXiv Detail & Related papers (2023-06-27T14:47:43Z) - Large-scale Unsupervised Semantic Segmentation [163.3568726730319]
We propose a new problem of large-scale unsupervised semantic segmentation (LUSS) with a newly created benchmark dataset to track the research progress.
Based on the ImageNet dataset, we propose the ImageNet-S dataset with 1.2 million training images and 40k high-quality semantic segmentation annotations for evaluation.
arXiv Detail & Related papers (2021-06-06T15:02:11Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - Visual Tracking by TridentAlign and Context Embedding [71.60159881028432]
We propose novel TridentAlign and context embedding modules for Siamese network-based visual tracking methods.
The performance of the proposed tracker is comparable to that of state-of-the-art trackers, while the proposed tracker runs at real-time speed.
arXiv Detail & Related papers (2020-07-14T08:00:26Z) - X-ModalNet: A Semi-Supervised Deep Cross-Modal Network for
Classification of Remote Sensing Data [69.37597254841052]
We propose a novel cross-modal deep-learning framework called X-ModalNet.
X-ModalNet generalizes well, owing to propagating labels on an updatable graph constructed by high-level features on the top of the network.
We evaluate X-ModalNet on two multi-modal remote sensing datasets (HSI-MSI and HSI-SAR) and achieve a significant improvement in comparison with several state-of-the-art methods.
arXiv Detail & Related papers (2020-06-24T15:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.