Pyramid Hierarchical Transformer for Hyperspectral Image Classification
- URL: http://arxiv.org/abs/2404.14945v1
- Date: Tue, 23 Apr 2024 11:41:19 GMT
- Title: Pyramid Hierarchical Transformer for Hyperspectral Image Classification
- Authors: Muhammad Ahmad, Muhammad Hassaan Farooq Butt, Manuel Mazzara, Salvatore Distifano,
- Abstract summary: We propose a pyramid-based hierarchical transformer (PyFormer)
This innovative approach organizes input data hierarchically into segments, each representing distinct abstraction levels.
Results underscore the superiority of the proposed method over traditional approaches.
- Score: 1.9427851979929982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The traditional Transformer model encounters challenges with variable-length input sequences, particularly in Hyperspectral Image Classification (HSIC), leading to efficiency and scalability concerns. To overcome this, we propose a pyramid-based hierarchical transformer (PyFormer). This innovative approach organizes input data hierarchically into segments, each representing distinct abstraction levels, thereby enhancing processing efficiency for lengthy sequences. At each level, a dedicated transformer module is applied, effectively capturing both local and global context. Spatial and spectral information flow within the hierarchy facilitates communication and abstraction propagation. Integration of outputs from different levels culminates in the final input representation. Experimental results underscore the superiority of the proposed method over traditional approaches. Additionally, the incorporation of disjoint samples augments robustness and reliability, thereby highlighting the potential of our approach in advancing HSIC. The source code is available at https://github.com/mahmad00/PyFormer.
Related papers
- MASSFormer: Mobility-Aware Spectrum Sensing using Transformer-Driven
Tiered Structure [3.6194127685460553]
We develop a mobility-aware transformer-driven structure (MASSFormer) based cooperative sensing method.
Our method considers a dynamic scenario involving mobile primary users (PUs) and secondary users (SUs)
The proposed method is tested under imperfect reporting channel scenarios to show robustness.
arXiv Detail & Related papers (2024-09-26T05:25:25Z) - Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - AerialFormer: Multi-resolution Transformer for Aerial Image Segmentation [7.415370401064414]
We propose AerialFormer, which unifies Transformers at the contracting path with lightweight Multi-Dilated Conal Neural Networks (MD-CNNs) at the expanding path.
Our AerialFormer is designed as a hierarchical structure, in which Transformer outputs multi-scale features and MD-CNNs decoder aggregates information from the multi-scales.
We have benchmarked AerialFormer on three common datasets including iSAID, LoveDA, and Potsdam.
arXiv Detail & Related papers (2023-06-12T03:28:18Z) - A Contrastive Learning Scheme with Transformer Innate Patches [4.588028371034407]
We present Contrastive Transformer, a contrastive learning scheme using the Transformer innate patches.
The scheme performs supervised patch-level contrastive learning, selecting the patches based on the ground truth mask.
The scheme applies to all vision-transformer architectures, is easy to implement, and introduces minimal additional memory footprint.
arXiv Detail & Related papers (2023-03-26T20:19:28Z) - Efficient End-to-End Video Question Answering with Pyramidal Multimodal
Transformer [13.71165050314854]
We present a new method for end-to-end Video Questioning (VideoQA)
We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer.
We demonstrate better or on-par performances with high computational efficiency against state-the-art methods on five VideoQA benchmarks.
arXiv Detail & Related papers (2023-02-04T09:14:18Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D
Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation.
Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path.
Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.