Related papers: Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

URL: http://arxiv.org/abs/2601.09719v1
Date: Fri, 26 Dec 2025 06:22:13 GMT
Title: Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models
Authors: Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song,
Abstract summary: We propose Bounded Hyperbolic Tanh (BHyT) as a drop-in replacement for Pre-LN.<n>BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range.<n>It achieves an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm.
Score: 20.802982614533615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: https://anonymous.4open.science/r/BHyT

Related papers

Efficient Inference after Directionally Stable Adaptive Experiments [47.32051320630248]
We study inference on pathwise differentiable targets after adaptive data collection, such as a bandit.<n>We introduce a novel target-specific condition, directional stability, which is strictly weaker than previously imposed target-aparametric stability conditions.
arXiv Detail & Related papers (2026-02-25T01:09:18Z)
Plug-and-Play Homeostatic Spark: Zero-Cost Acceleration for SNN Training Across Paradigms [40.57310813106791]
Spiking neural networks offer event driven computation, sparse activation, and hardware efficiency, yet training often converges slowly and lacks stability.<n>We present Adaptive Homeostatic Spiking Activity Regulation (AHSAR), an extremely simple plug in and training paradigm method.<n>AHSAR stabilizes optimization and accelerates convergence without changing the model architecture, loss, or gradients.
arXiv Detail & Related papers (2025-12-04T17:26:46Z)
Leave-One-Out Stable Conformal Prediction [5.573524700758741]
We propose a novel method to speed up full conformal using algorithmic stability without sample splitting.<n>By leveraging leave-one-out stability, our method is much faster in handling a large number of prediction requests.<n>Our method is theoretically justified and demonstrates superior numerical performance on synthetic and real-world data.
arXiv Detail & Related papers (2025-04-16T15:44:24Z)
Large Continual Instruction Assistant [59.585544987096974]
Continual Instruction Tuning (CIT) is adopted to instruct Large Models to follow human intent data by data.<n>Existing update gradient would heavily destroy the performance on previous datasets during CIT process.<n>We propose a general continual instruction tuning framework to address the challenge.
arXiv Detail & Related papers (2024-10-08T11:24:59Z)
Revisiting Essential and Nonessential Settings of Evidential Deep Learning [70.82728812001807]
Evidential Deep Learning (EDL) is an emerging method for uncertainty estimation. We propose Re-EDL, a simplified yet more effective variant of EDL.
arXiv Detail & Related papers (2024-10-01T04:27:07Z)
Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding [55.107555305760954]
We propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion. Our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.
arXiv Detail & Related papers (2024-01-17T09:01:29Z)
Non-convex Bayesian Learning via Stochastic Gradient Markov Chain Monte Carlo [4.656426393230839]
The rise of artificial intelligence (AI) hinges on efficient of modern deep neural networks (DNNs) for non-trips and uncertainty. In this thesis we propose a tool to handle the problem of Monte Carlo exploitation. We also propose two dynamic importance sampling algorithms for the underlying ordinary equation (ODE) system.
arXiv Detail & Related papers (2023-05-30T18:25:11Z)
A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks [65.34977803841007]
Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience. We show how by simply changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one.
arXiv Detail & Related papers (2022-11-16T00:11:04Z)
Stability of Accuracy for the Training of DNNs Via the Uniform Doubling Condition [0.0]
We study the stability of accuracy during the training of deep neural networks (DNNs) The goal of achieving stability of accuracy is to ensure that if accuracy is high at some initial time, it remains high throughout training.
arXiv Detail & Related papers (2022-10-16T02:42:42Z)
Feedback Gradient Descent: Efficient and Stable Optimization with Orthogonality for DNNs [3.42658286826597]
We propose a novel method, named Feedback Gradient Descent (FGD), to our knowledge, the first work showing high efficiency and stability simultaneously. In the extensive image classification experiments, FGD comprehensively outperforms the existing state-of-the-art methods in terms of accuracy, efficiency, and stability.
arXiv Detail & Related papers (2022-05-12T03:47:27Z)
SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets. Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.