Related papers: SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

URL: http://arxiv.org/abs/2602.08064v1
Date: Sun, 08 Feb 2026 17:17:56 GMT
Title: SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm
Authors: Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang,
Abstract summary: Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture.<n>We propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters.<n>This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm.
Score: 31.43772956034752
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture. Prior attempts to combine their strengths typically lead to a stability-performance trade-off. We attribute this phenomenon to a structural incompatibility within a single-stream design: Any application of the Post-Norm operation inevitably obstructs the clean identity gradient preserved by Pre-Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre-training experiments on 1.3B-parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen-Applications/SiameseNorm.

Related papers

The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers [9.020853139493239]
We argue that one cause for non-convergence and instability is their non-decaying step-size scheduling.<n>We propose a series of new attack algorithms that enforce Monotonically Decreasing Coordinate-wise Step-sizes within sign-based adversarials.
arXiv Detail & Related papers (2026-02-22T08:37:06Z)
PRISM: Parallel Residual Iterative Sequence Model [52.26239951489612]
We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension.<n>PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form.<n>We prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck.
arXiv Detail & Related papers (2026-02-11T12:39:41Z)
SpanNorm: Reconciling Training Stability and Performance in Deep Transformers [55.100133502295996]
We propose SpanNorm, a novel technique designed to resolve the dilemma by integrating the strengths of both paradigms.<n>We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network.<n> Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios.
arXiv Detail & Related papers (2026-01-30T05:21:57Z)
CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval [54.15776146365823]
Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text.<n>We propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components.
arXiv Detail & Related papers (2026-01-07T09:21:38Z)
Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration [52.82397287366076]
All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework.<n>In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information.<n>Our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections.
arXiv Detail & Related papers (2025-12-11T12:20:31Z)
Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness [9.013874391203453]
Adversarial examples reveal critical vulnerabilities in deep neural networks by exploiting their sensitivity to imperceptible input perturbations.<n>In this work, we investigate an architectural approach to adversarial robustness by embedding group-equivariant convolutions.<n>These layers encode symmetry priors that align model behavior with structured transformations in the input space, promoting smoother decision boundaries.
arXiv Detail & Related papers (2025-10-17T19:26:58Z)
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization [25.87557024380553]
We propose a simple yet effective hybrid normalization strategy that integrates the advantages of Pre-Norm and Post-Norm.<n>In experiments on large-scale transformer models, we show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches.<n>These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models.
arXiv Detail & Related papers (2025-03-06T16:40:48Z)
Stability of Primal-Dual Gradient Flow Dynamics for Multi-Block Convex Optimization Problems [2.66854711376491]
Proposed dynamics are based on the proximal augmented Lagrangian and they provide a viable alternative to ADMM.<n>We leverage various structural properties to establish global (exponential) convergence guarantees for the proposed dynamics.<n>Our assumptions are much weaker than those required to prove (exponential) stability of primal-dual dynamics.
arXiv Detail & Related papers (2024-08-28T17:43:18Z)
Improving Transferability of Adversarial Examples via Bayesian Attacks [68.90574788107442]
adversarial examples allows for the attack on unknown deep neural networks (DNNs)<n>In this paper, we improve the transferability of adversarial examples by incorporating the Bayesian formulation into both the model parameters and model input.<n>Experiments demonstrate that our method achieves a new state-of-the-art in transfer-based attacks.
arXiv Detail & Related papers (2023-07-21T03:43:07Z)
Nonparametric Generative Modeling with Conditional Sliced-Wasserstein Flows [101.31862036510701]
Sliced-Wasserstein Flow (SWF) is a promising approach to nonparametric generative modeling but has not been widely adopted due to its suboptimal generative quality and lack of conditional modeling capabilities. We propose Conditional Sliced-Wasserstein Flow (CSWF), a simple yet effective extension of SWF that enables nonparametric conditional modeling.
arXiv Detail & Related papers (2023-05-03T14:55:43Z)
Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs) GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations. We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.