FactoFormer: Factorized Hyperspectral Transformers with Self-Supervised
Pretraining
- URL: http://arxiv.org/abs/2309.09431v4
- Date: Thu, 4 Jan 2024 01:05:16 GMT
- Title: FactoFormer: Factorized Hyperspectral Transformers with Self-Supervised
Pretraining
- Authors: Shaheer Mohamed, Maryam Haghighat, Tharindu Fernando, Sridha
Sridharan, Clinton Fookes, Peyman Moghadam
- Abstract summary: Hyperspectral images (HSIs) contain rich spectral and spatial information.
Current state-of-the-art hyperspectral transformers only tokenize the input HSI sample along the spectral dimension.
We propose a novel factorized spectral-spatial transformer that incorporates factorized self-supervised pretraining procedures.
- Score: 36.44039681893334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hyperspectral images (HSIs) contain rich spectral and spatial information.
Motivated by the success of transformers in the field of natural language
processing and computer vision where they have shown the ability to learn long
range dependencies within input data, recent research has focused on using
transformers for HSIs. However, current state-of-the-art hyperspectral
transformers only tokenize the input HSI sample along the spectral dimension,
resulting in the under-utilization of spatial information. Moreover,
transformers are known to be data-hungry and their performance relies heavily
on large-scale pretraining, which is challenging due to limited annotated
hyperspectral data. Therefore, the full potential of HSI transformers has not
been fully realized. To overcome these limitations, we propose a novel
factorized spectral-spatial transformer that incorporates factorized
self-supervised pretraining procedures, leading to significant improvements in
performance. The factorization of the inputs allows the spectral and spatial
transformers to better capture the interactions within the hyperspectral data
cubes. Inspired by masked image modeling pretraining, we also devise efficient
masking strategies for pretraining each of the spectral and spatial
transformers. We conduct experiments on six publicly available datasets for HSI
classification task and demonstrate that our model achieves state-of-the-art
performance in all the datasets. The code for our model will be made available
at https://github.com/csiro-robotics/factoformer.
Related papers
- Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies.
We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - Transformers For Recognition In Overhead Imagery: A Reality Check [0.0]
We compare the impact of adding transformer structures into state-of-the-art segmentation models for overhead imagery.
Our results suggest that transformers provide consistent, but modest, performance improvements.
arXiv Detail & Related papers (2022-10-23T02:17:31Z) - Towards Data-Efficient Detection Transformers [77.43470797296906]
We show most detection transformers suffer from significant performance drops on small-size datasets.
We empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR.
We introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency.
arXiv Detail & Related papers (2022-03-17T17:56:34Z) - Gophormer: Ego-Graph Transformer for Node Classification [27.491500255498845]
In this paper, we propose a novel Gophormer model which applies transformers on ego-graphs instead of full-graphs.
Specifically, Node2Seq module is proposed to sample ego-graphs as the input of transformers, which alleviates the challenge of scalability.
In order to handle the uncertainty introduced by the ego-graph sampling, we propose a consistency regularization and a multi-sample inference strategy.
arXiv Detail & Related papers (2021-10-25T16:43:32Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.