Related papers: Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization

Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization

URL: http://arxiv.org/abs/2504.06629v2
Date: Thu, 26 Jun 2025 01:23:27 GMT
Title: Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization
Authors: MinKyu Lee, Sangeek Hyun, Woojin Jun, Hyunjun Kim, Jiwoo Chung, Jae-Pil Heo,
Abstract summary: Conventional LayerNorm leads feature magnitude divergence, up to a million scale, and collapses channel-wise entropy.<n>We introduce Image Restoration Transformer Tailored Layer Normalization(i-LN), a surprisingly simple drop-in replacement for conventional LayerNorm.
Score: 20.67671141789497
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work investigates the internal training dynamics of image restoration~(IR) Transformers and uncovers a critical yet overlooked issue: conventional LayerNorm leads feature magnitude divergence, up to a million scale, and collapses channel-wise entropy. We analyze this phenomenon from the perspective of networks attempting to bypass constraints imposed by conventional LayerNorm due to conflicts against requirements in IR tasks. Accordingly, we address two misalignments between LayerNorm and IR tasks, and later show that addressing these mismatches leads to both stabilized training dynamics and improved IR performance. Specifically, conventional LayerNorm works in a per-token manner, disrupting spatial correlations between tokens, essential in IR tasks. Also, it employs an input-independent normalization that restricts the flexibility of feature scales, required to preserve input-specific statistics. Together, these mismatches significantly hinder IR Transformer's ability to accurately preserve low-level features throughout the network. To this end, we introduce Image Restoration Transformer Tailored Layer Normalization~(i-LN), a surprisingly simple drop-in replacement for conventional LayerNorm. We propose to normalize features in a holistic manner across the entire spatio-channel dimension, preserving spatial relationships among individual tokens. Additionally, we introduce an input-adaptive rescaling strategy that maintains the feature range flexibility required by individual inputs. Together, these modifications effectively contribute to preserving low-level feature statistics of inputs throughout IR Transformers. Experimental results verify that this combined strategy enhances both the stability and performance of IR Transformers across various IR tasks.

Related papers

Generalizing GNNs with Tokenized Mixture of Experts [75.8310720413187]
We show that improving stability requires reducing reliance on shift-sensitive features, leaving an irreducible worst-case generalization floor.<n>We propose STEM-GNN, a pretrain-then-finetune framework with a mixture-of-experts encoder for diverse computation paths.<n>Across nine node, link, and graph benchmarks, STEM-GNN achieves a stronger three-way balance, improving robustness to degree/homophily shifts and to feature/edge corruptions while remaining competitive on clean graphs.
arXiv Detail & Related papers (2026-02-09T22:48:30Z)
Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z)
Rotation Equivariant Arbitrary-scale Image Super-Resolution [62.41329042683779]
The arbitrary-scale image super-resolution (ASISR) aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image.<n>We make efforts to construct a rotation equivariant ASISR method in this study.
arXiv Detail & Related papers (2025-08-07T08:51:03Z)
Manifold-aware Representation Learning for Degradation-agnostic Image Restoration [135.90908995927194]
Image Restoration (IR) aims to recover high quality images from degraded inputs affected by various corruptions such as noise, blur, haze, rain, and low light conditions.<n>We present MIRAGE, a unified framework for all in one IR that explicitly decomposes the input feature space into three semantically aligned parallel branches.<n>This modular decomposition significantly improves generalization and efficiency across diverse degradations.
arXiv Detail & Related papers (2025-05-24T12:52:10Z)
Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement [13.38679135071682]
We propose a Function-preserving Attention Replacement framework that replaces all attention blocks in pretrained transformers with learnable sequence-to-sequence modules.<n>We validate FAR on the DeiT vision transformer family and demonstrate that it matches the accuracy of the original models on ImageNet and multiple downstream tasks with reduced parameters and latency.
arXiv Detail & Related papers (2025-05-24T02:23:46Z)
Cross Paradigm Representation and Alignment Transformer for Image Deraining [40.66823807648992]
We propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer)<n>Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms to aid image reconstruction.<n>We use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA)
arXiv Detail & Related papers (2025-04-23T06:44:46Z)
RSRWKV: A Linear-Complexity 2D Attention Mechanism for Efficient Remote Sensing Vision Task [20.16344973940904]
High-resolution remote sensing analysis faces challenges due to scene complexity and scale diversity.<n>We propose RSRWKV, featuring a novel 2D-WKV scanning mechanism that bridges sequential processing and 2D spatial reasoning.
arXiv Detail & Related papers (2025-03-26T10:03:46Z)
Transformer Meets Twicing: Harnessing Unattended Residual Information [2.1605931466490795]
Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks.<n>While the self-attention mechanism has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers.<n>We propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing.
arXiv Detail & Related papers (2025-03-02T01:56:35Z)
DINT Transformer [5.990713912057883]
DIFF Transformer addresses the issue of irrelevant context interference by introducing a differential attention mechanism.<n>We propose DINT Transformer, which extends DIFF Transformer by incorporating a differential-integral mechanism.
arXiv Detail & Related papers (2025-01-29T08:53:29Z)
Sharing Key Semantics in Transformer Makes Efficient Image Restoration [148.22790334216117]
Self-attention mechanism, a cornerstone of Vision Transformers (ViTs) tends to encompass all global cues.<n>Small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process.<n>We propose boosting IR's performance by sharing the key semantics via Transformer for IR (ie, SemanIR) in this paper.
arXiv Detail & Related papers (2024-05-30T12:45:34Z)
Learning Exhaustive Correlation for Spectral Super-Resolution: Where Spatial-Spectral Attention Meets Linear Dependence [26.1694389791047]
Spectral super-resolution aims to recover hyperspectral image (HSI) from easily obtainable RGB image. Two types of bottlenecks in existing Transformers limit performance improvement and practical applications. We propose a novel Exhaustive Correlation Transformer (ECT) for spectral super-resolution.
arXiv Detail & Related papers (2023-12-20T08:30:07Z)
Dual Aggregation Transformer for Image Super-Resolution [92.41781921611646]
We propose a novel Transformer model, Dual Aggregation Transformer, for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Our experiments show that our DAT surpasses current methods.
arXiv Detail & Related papers (2023-08-07T07:39:39Z)
ESSAformer: Efficient Transformer for Hyperspectral Image Super-resolution [76.7408734079706]
Single hyperspectral image super-resolution (single-HSI-SR) aims to restore a high-resolution hyperspectral image from a low-resolution observation. We propose ESSAformer, an ESSA attention-embedded Transformer network for single-HSI-SR with an iterative refining structure.
arXiv Detail & Related papers (2023-07-26T07:45:14Z)
Reciprocal Attention Mixing Transformer for Lightweight Image Restoration [6.3159191692241095]
We propose a lightweight image restoration network, Reciprocal Attention Mixing Transformer (RAMiT) It employs bi-dimensional (spatial and channel) self-attentions in parallel with different numbers of multi-heads. It achieves state-of-the-art performance on multiple lightweight IR tasks, including super-resolution, color denoising, grayscale denoising, low-light enhancement, and deraining.
arXiv Detail & Related papers (2023-05-19T06:55:04Z)
Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics. By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information. One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z)
Exploring Invariant Representation for Visible-Infrared Person Re-Identification [77.06940947765406]
Cross-spectral person re-identification, which aims to associate identities to pedestrians across different spectra, faces a main challenge of the modality discrepancy. In this paper, we address the problem from both image-level and feature-level in an end-to-end hybrid learning framework named robust feature mining network (RFM) Experiment results on two standard cross-spectral person re-identification datasets, RegDB and SYSU-MM01, have demonstrated state-of-the-art performance.
arXiv Detail & Related papers (2023-02-02T05:24:50Z)
Score-based Causal Representation Learning with Interventions [54.735484409244386]
This paper studies the causal representation learning problem when latent causal variables are observed indirectly. The objectives are: (i) recovering the unknown linear transformation (up to scaling) and (ii) determining the directed acyclic graph (DAG) underlying the latent variables.
arXiv Detail & Related papers (2023-01-19T18:39:48Z)
Large-scale Global Low-rank Optimization for Computational Compressed Imaging [8.594666859332124]
We present the global low-rank (GLR) optimization technique, realizing highly-efficient large-scale reconstruction with global self-similarity. Inspired by the self-attention mechanism in deep learning, GLR extracts image patches by feature detection instead of conventional uniform selection. We experimentally demonstrate GLR's effectiveness on temporal, frequency, and spectral dimensions.
arXiv Detail & Related papers (2023-01-08T14:12:51Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
Stochastic Layers in Vision Transformers [85.38733795180497]
We introduce fully layers in vision transformers, without causing any severe drop in performance. The additionality boosts the robustness of visual features and strengthens privacy. We use our features for three different applications, namely, adversarial robustness, network calibration, and feature privacy.
arXiv Detail & Related papers (2021-12-30T16:07:59Z)
Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z)
Learning High-Precision Bounding Box for Rotated Object Detection via Kullback-Leibler Divergence [100.6913091147422]
Existing rotated object detectors are mostly inherited from the horizontal detection paradigm. In this paper, we are motivated to change the design of rotation regression loss from induction paradigm to deduction methodology.
arXiv Detail & Related papers (2021-06-03T14:29:19Z)
Multivariate Functional Regression via Nested Reduced-Rank Regularization [2.730097437607271]
We propose a nested reduced-rank regression (NRRR) approach in fitting regression model with multivariate functional responses and predictors. We show through non-asymptotic analysis that NRRR can achieve at least a comparable error rate to that of the reduced-rank regression. We apply NRRR in an electricity demand problem, to relate the trajectories of the daily electricity consumption with those of the daily temperatures.
arXiv Detail & Related papers (2020-03-10T14:58:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.