THAT: Token-wise High-frequency Augmentation Transformer for Hyperspectral Pansharpening
- URL: http://arxiv.org/abs/2508.08183v1
- Date: Mon, 11 Aug 2025 17:03:10 GMT
- Title: THAT: Token-wise High-frequency Augmentation Transformer for Hyperspectral Pansharpening
- Authors: Hongkun Jin, Hongcheng Jiang, Zejun Zhang, Yuan Zhang, Jia Fu, Tingfeng Li, Kai Luo,
- Abstract summary: Transformer-based methods have demonstrated strong potential in hyperspectral pansharpening.<n>Their effectiveness is often limited by redundant token representations and a lack of multi-scale feature modeling.<n>We propose the Token-wise High-frequency Augmentation Transformer (THAT), a novel framework designed to enhance hyperspectral pansharpening.
- Score: 11.026691195419453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based methods have demonstrated strong potential in hyperspectral pansharpening by modeling long-range dependencies. However, their effectiveness is often limited by redundant token representations and a lack of multi-scale feature modeling. Hyperspectral images exhibit intrinsic spectral priors (e.g., abundance sparsity) and spatial priors (e.g., non-local similarity), which are critical for accurate reconstruction. From a spectral-spatial perspective, Vision Transformers (ViTs) face two major limitations: they struggle to preserve high-frequency components--such as material edges and texture transitions--and suffer from attention dispersion across redundant tokens. These issues stem from the global self-attention mechanism, which tends to dilute high-frequency signals and overlook localized details. To address these challenges, we propose the Token-wise High-frequency Augmentation Transformer (THAT), a novel framework designed to enhance hyperspectral pansharpening through improved high-frequency feature representation and token selection. Specifically, THAT introduces: (1) Pivotal Token Selective Attention (PTSA) to prioritize informative tokens and suppress redundancy; (2) a Multi-level Variance-aware Feed-forward Network (MVFN) to enhance high-frequency detail learning. Experiments on standard benchmarks show that THAT achieves state-of-the-art performance with improved reconstruction quality and efficiency. The source code is available at https://github.com/kailuo93/THAT.
Related papers
- Transformer-Based Person Search with High-Frequency Augmentation and Multi-Wave Mixing [18.871765626140782]
We propose a novel High-frequency Augmentation and Multi-Wave mixing (HAMW) method for person search.<n>HamW is designed to enhance the discriminative feature extraction capabilities of transformers.<n>HamW achieves state-of-the-art performance on both the CUHK-SYSU and PRW datasets.
arXiv Detail & Related papers (2025-06-29T12:08:26Z) - FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution [70.61549422952193]
Face super-resolution (FSR) under limited computational costs remains an open problem.<n>Existing approaches typically treat all facial pixels equally, resulting in suboptimal allocation of computational resources.<n>We propose FADPNet, a Frequency-Aware Dual-Path Network that decomposes facial features into low- and high-frequency components.
arXiv Detail & Related papers (2025-06-17T02:33:42Z) - FreEformer: Frequency Enhanced Transformer for Multivariate Time Series Forecasting [17.738942892605234]
This paper presents textbfFreEformer, a simple yet effective model that leverages a textbfFrequency textbfEnhanced Transtextbfformer.<n>Experiments demonstrate that FreEformer consistently outperforms state-of-the-art models on eighteen real-world benchmarks.
arXiv Detail & Related papers (2025-01-23T08:53:45Z) - Identity-Preserving Text-to-Video Generation by Frequency Decomposition [52.19475797580653]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature.<n>We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video.
arXiv Detail & Related papers (2024-11-26T13:58:24Z) - Spiking Wavelet Transformer [1.8712213089437697]
Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning.
Transformers with SNNs have shown promise for accuracy, but struggle to learn high-frequency patterns.
We propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner.
arXiv Detail & Related papers (2024-03-17T08:41:48Z) - ProtoHPE: Prototype-guided High-frequency Patch Enhancement for
Visible-Infrared Person Re-identification [16.634909655008254]
Cross-modal correlated high-frequency components are less affected by variations such as wavelength, pose, and background clutter than holistic images.
We propose textbfPrototype-guided textbfHigh-frequency textbfPatch textbfEnhancement (ProtoHPE) with two core designs.
arXiv Detail & Related papers (2023-10-11T14:54:40Z) - Exploring Frequency-Inspired Optimization in Transformer for Efficient Single Image Super-Resolution [32.29219284419944]
Cross-refinement adaptive feature modulation transformer (CRAFT)<n>We introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency.<n>Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods.
arXiv Detail & Related papers (2023-08-09T15:38:36Z) - Wave-ViT: Unifying Wavelet and Transformers for Visual Representation
Learning [138.29273453811945]
Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks.
We propose a new Wavelet Vision Transformer (textbfWave-ViT) that formulates the invertible down-sampling with wavelet transforms and self-attention learning.
arXiv Detail & Related papers (2022-07-11T16:03:51Z) - Inception Transformer [151.939077819196]
Inception Transformer, or iFormer, learns comprehensive features with both high- and low-frequency information in visual data.
We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation.
arXiv Detail & Related papers (2022-05-25T17:59:54Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task.
We employ a restrictive CNN with small and non-overlapping RF for token representation.
In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.