ViTs for SITS: Vision Transformers for Satellite Image Time Series
- URL: http://arxiv.org/abs/2301.04944v3
- Date: Fri, 14 Apr 2023 09:56:37 GMT
- Title: ViTs for SITS: Vision Transformers for Satellite Image Time Series
- Authors: Michail Tarasiou, Erik Chavez, Stefanos Zafeiriou
- Abstract summary: We introduce a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT)
TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder.
- Score: 52.012084080257544
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a
fully-attentional model for general Satellite Image Time Series (SITS)
processing based on the Vision Transformer (ViT). TSViT splits a SITS record
into non-overlapping patches in space and time which are tokenized and
subsequently processed by a factorized temporo-spatial encoder. We argue, that
in contrast to natural images, a temporal-then-spatial factorization is more
intuitive for SITS processing and present experimental evidence for this claim.
Additionally, we enhance the model's discriminative power by introducing two
novel mechanisms for acquisition-time-specific temporal positional encodings
and multiple learnable class tokens. The effect of all novel design choices is
evaluated through an extensive ablation study. Our proposed architecture
achieves state-of-the-art performance, surpassing previous approaches by a
significant margin in three publicly available SITS semantic segmentation and
classification datasets. All model, training and evaluation codes are made
publicly available to facilitate further research.
Related papers
- PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - Multi-Modal Vision Transformers for Crop Mapping from Satellite Image Time Series [2.5245269564204653]
Existing state-of-the-art architectures use self-attention mechanisms to process the temporal dimension and convolutions for the spatial dimension of SITS.
Motivated by the success of purely attention-based architectures in crop mapping from single-modal SITS, we introduce several multi-modal multi-temporal transformer-based architectures.
Experimental results demonstrate significant improvements over state-of-the-art architectures with both convolutional and self-attention components.
arXiv Detail & Related papers (2024-06-24T10:40:46Z) - TimeTuner: Diagnosing Time Representations for Time-Series Forecasting
with Counterfactual Explanations [3.8357850372472915]
This paper contributes a novel visual analytics framework, namely TimeTuner, to help analysts understand how model behaviors are associated with localized, stationarity, and correlations of time-series representations.
We show that TimeTuner can help characterize time-series representations and guide the feature engineering processes.
arXiv Detail & Related papers (2023-07-19T11:40:15Z) - Revisiting the Encoding of Satellite Image Time Series [2.5874041837241304]
Image Time Series (SITS)temporal learning is complex due to hightemporal resolutions and irregular acquisition times.
We develop a novel perspective of SITS processing as a direct set prediction problem, inspired by the recent trend in adopting query-based transformer decoders.
We attain new state-of-the-art (SOTA) results on the Satellite PASTIS benchmark dataset.
arXiv Detail & Related papers (2023-05-03T12:44:20Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - SatMAE: Pre-training Transformers for Temporal and Multi-Spectral
Satellite Imagery [74.82821342249039]
We present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE)
To leverage temporal information, we include a temporal embedding along with independently masking image patches across time.
arXiv Detail & Related papers (2022-07-17T01:35:29Z) - Tampered VAE for Improved Satellite Image Time Series Classification [1.933681537640272]
Pyramid Time-Series Transformer (PTST) operates solely on the temporal dimension.
We propose a classification-friendly VAE framework that introduces clustering mechanisms into latent space.
We hope the proposed framework can serve as a baseline for crop classification with SITS for its modularity and simplicity.
arXiv Detail & Related papers (2022-03-30T08:48:06Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.