Contextual Learning in Fourier Complex Field for VHR Remote Sensing
Images
- URL: http://arxiv.org/abs/2210.15972v1
- Date: Fri, 28 Oct 2022 08:13:33 GMT
- Title: Contextual Learning in Fourier Complex Field for VHR Remote Sensing
Images
- Authors: Yan Zhang, Xiyuan Gao, Qingyan Duan, Jiaxu Leng, Xiao Pu, Xinbo Gao
- Abstract summary: transformer-based models demonstrated outstanding potential for learning high-order contextual relationships from natural images with general resolution (224x224 pixels)
We propose a complex self-attention (CSA) mechanism to model the high-order contextual information with less than half computations of naive SA.
By stacking various layers of CSA blocks, we propose the Fourier Complex Transformer (FCT) model to learn global contextual information from VHR aerial images.
- Score: 64.84260544255477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Very high-resolution (VHR) remote sensing (RS) image classification is the
fundamental task for RS image analysis and understanding. Recently,
transformer-based models demonstrated outstanding potential for learning
high-order contextual relationships from natural images with general resolution
(224x224 pixels) and achieved remarkable results on general image
classification tasks. However, the complexity of the naive transformer grows
quadratically with the increase in image size, which prevents transformer-based
models from VHR RS image (500x500 pixels) classification and other
computationally expensive downstream tasks. To this end, we propose to
decompose the expensive self-attention (SA) into real and imaginary parts via
discrete Fourier transform (DFT) and therefore propose an efficient complex
self-attention (CSA) mechanism. Benefiting from the conjugated symmetric
property of DFT, CSA is capable to model the high-order contextual information
with less than half computations of naive SA. To overcome the gradient
explosion in Fourier complex field, we replace the Softmax function with the
carefully designed Logmax function to normalize the attention map of CSA and
stabilize the gradient propagation. By stacking various layers of CSA blocks,
we propose the Fourier Complex Transformer (FCT) model to learn global
contextual information from VHR aerial images following the hierarchical
manners. Universal experiments conducted on commonly used RS classification
data sets demonstrate the effectiveness and efficiency of FCT, especially on
very high-resolution RS images.
Related papers
- Effective Diffusion Transformer Architecture for Image Super-Resolution [63.254644431016345]
We design an effective diffusion transformer for image super-resolution (DiT-SR)
In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks.
We analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module.
arXiv Detail & Related papers (2024-09-29T07:14:16Z) - Multi-Scale Representation Learning for Image Restoration with State-Space Model [13.622411683295686]
We propose a novel Multi-Scale State-Space Model-based (MS-Mamba) for efficient image restoration.
Our proposed method achieves new state-of-the-art performance while maintaining low computational complexity.
arXiv Detail & Related papers (2024-08-19T16:42:58Z) - Task-Aware Dynamic Transformer for Efficient Arbitrary-Scale Image Super-Resolution [8.78015409192613]
Arbitrary-scale super-resolution (ASSR) aims to learn a single model for image super-resolution at arbitrary magnifying scales.
Existing ASSR networks typically comprise an off-the-shelf scale-agnostic feature extractor and an arbitrary scale upsampler.
We propose a Task-Aware Dynamic Transformer (TADT) as an input-adaptive feature extractor for efficient image ASSR.
arXiv Detail & Related papers (2024-08-16T13:35:52Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - Misalignment-Robust Frequency Distribution Loss for Image Transformation [51.0462138717502]
This paper aims to address a common challenge in deep learning-based image transformation methods, such as image enhancement and super-resolution.
We introduce a novel and simple Frequency Distribution Loss (FDL) for computing distribution distance within the frequency domain.
Our method is empirically proven effective as a training constraint due to the thoughtful utilization of global information in the frequency domain.
arXiv Detail & Related papers (2024-02-28T09:27:41Z) - AICT: An Adaptive Image Compression Transformer [18.05997169440533]
We propose a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT)
The proposed ICT can capture both global and local contexts from the latent representations.
We leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation.
arXiv Detail & Related papers (2023-07-12T11:32:02Z) - Recursive Generalization Transformer for Image Super-Resolution [108.67898547357127]
We propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images.
We combine the RG-SA with local self-attention to enhance the exploitation of the global context.
Our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively.
arXiv Detail & Related papers (2023-03-11T10:44:44Z) - Implicit Transformer Network for Screen Content Image Continuous
Super-Resolution [27.28782217250359]
High-resolution (HR) screen contents may be downsampled and compressed.
Super-resolution (SR) of low-resolution (LR) screen content images (SCIs) is highly demanded by the HR display or by the users to zoom in for detail observation.
We propose a novel Implicit Transformer Super-Resolution Network (ITSRN) for SCISR.
arXiv Detail & Related papers (2021-12-12T07:39:37Z) - TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task.
We employ a restrictive CNN with small and non-overlapping RF for token representation.
In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.