SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm
- URL: http://arxiv.org/abs/2602.08064v1
- Date: Sun, 08 Feb 2026 17:17:56 GMT
- Title: SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm
- Authors: Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang,
- Abstract summary: Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture.<n>We propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters.<n>This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm.
- Score: 31.43772956034752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture. Prior attempts to combine their strengths typically lead to a stability-performance trade-off. We attribute this phenomenon to a structural incompatibility within a single-stream design: Any application of the Post-Norm operation inevitably obstructs the clean identity gradient preserved by Pre-Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre-training experiments on 1.3B-parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen-Applications/SiameseNorm.
Related papers
- The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers [9.020853139493239]
We argue that one cause for non-convergence and instability is their non-decaying step-size scheduling.<n>We propose a series of new attack algorithms that enforce Monotonically Decreasing Coordinate-wise Step-sizes within sign-based adversarials.
arXiv Detail & Related papers (2026-02-22T08:37:06Z) - PRISM: Parallel Residual Iterative Sequence Model [52.26239951489612]
We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension.<n>PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form.<n>We prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck.
arXiv Detail & Related papers (2026-02-11T12:39:41Z) - SpanNorm: Reconciling Training Stability and Performance in Deep Transformers [55.100133502295996]
We propose SpanNorm, a novel technique designed to resolve the dilemma by integrating the strengths of both paradigms.<n>We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network.<n> Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios.
arXiv Detail & Related papers (2026-01-30T05:21:57Z) - CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval [54.15776146365823]
Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text.<n>We propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components.
arXiv Detail & Related papers (2026-01-07T09:21:38Z) - Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration [52.82397287366076]
All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework.<n>In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information.<n>Our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections.
arXiv Detail & Related papers (2025-12-11T12:20:31Z) - Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness [9.013874391203453]
Adversarial examples reveal critical vulnerabilities in deep neural networks by exploiting their sensitivity to imperceptible input perturbations.<n>In this work, we investigate an architectural approach to adversarial robustness by embedding group-equivariant convolutions.<n>These layers encode symmetry priors that align model behavior with structured transformations in the input space, promoting smoother decision boundaries.
arXiv Detail & Related papers (2025-10-17T19:26:58Z) - HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization [25.87557024380553]
We propose a simple yet effective hybrid normalization strategy that integrates the advantages of Pre-Norm and Post-Norm.<n>In experiments on large-scale transformer models, we show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches.<n>These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models.
arXiv Detail & Related papers (2025-03-06T16:40:48Z) - Stability of Primal-Dual Gradient Flow Dynamics for Multi-Block Convex Optimization Problems [2.66854711376491]
Proposed dynamics are based on the proximal augmented Lagrangian and they provide a viable alternative to ADMM.<n>We leverage various structural properties to establish global (exponential) convergence guarantees for the proposed dynamics.<n>Our assumptions are much weaker than those required to prove (exponential) stability of primal-dual dynamics.
arXiv Detail & Related papers (2024-08-28T17:43:18Z) - Improving Transferability of Adversarial Examples via Bayesian Attacks [68.90574788107442]
adversarial examples allows for the attack on unknown deep neural networks (DNNs)<n>In this paper, we improve the transferability of adversarial examples by incorporating the Bayesian formulation into both the model parameters and model input.<n>Experiments demonstrate that our method achieves a new state-of-the-art in transfer-based attacks.
arXiv Detail & Related papers (2023-07-21T03:43:07Z) - Nonparametric Generative Modeling with Conditional Sliced-Wasserstein
Flows [101.31862036510701]
Sliced-Wasserstein Flow (SWF) is a promising approach to nonparametric generative modeling but has not been widely adopted due to its suboptimal generative quality and lack of conditional modeling capabilities.
We propose Conditional Sliced-Wasserstein Flow (CSWF), a simple yet effective extension of SWF that enables nonparametric conditional modeling.
arXiv Detail & Related papers (2023-05-03T14:55:43Z) - Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs)
GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations.
We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.