Related papers: ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

URL: http://arxiv.org/abs/2410.09637v3
Date: Sat, 16 Nov 2024 17:59:35 GMT
Title: ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models
Authors: Nandan Kumar Jha, Brandon Reagen,
Abstract summary: LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. This work explores desirable activation functions in normalization-free decoder-only LLMs. ReLU significantly outperforms GELU in LayerNorm-free models, leading to an bf 8.2% perplexity improvement.
Score: 3.7802450241986945
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. However, it introduces significant challenges in mechanistic interpretability, outlier feature suppression, faithful signal propagation, and computational and communication complexity of private inference. This work explores desirable activation functions in normalization-free decoder-only LLMs. Contrary to the conventional preference for the GELU in transformer-based models, our empirical findings demonstrate an {\em opposite trend} -- ReLU significantly outperforms GELU in LayerNorm-free models, leading to an {\bf 8.2\%} perplexity improvement. We discover a key issue with GELU, where early layers experience entropic overload, leading to the under-utilization of the representational capacity of attention heads. This highlights that smoother activations like GELU are {\em ill-suited} for LayerNorm-free architectures, whereas ReLU's geometrical properties -- specialization in input space and intra-class selectivity -- lead to improved learning dynamics and better information retention in the absence of LayerNorm. This study offers key insights for optimizing transformer architectures where LayerNorm introduces significant challenges. The code and implementation are available at https://github.com/Nandan91/relu-revival-normfree

Related papers

EFC++: Elastic Feature Consolidation with Prototype Re-balancing for Cold Start Exemplar-free Incremental Learning [17.815956928177638]
We consider the challenging Cold Start scenario in which insufficient data is available in the first task to learn a high-quality backbone. This is especially challenging for EFCIL since it requires high plasticity, resulting in feature drift. We propose an effective approach to consolidate feature representations by regularizing drift in directions highly relevant to previous tasks.
arXiv Detail & Related papers (2025-03-13T15:01:19Z)
ReGLA: Refining Gated Linear Attention [42.97193398172823]
Linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. We also explored the saturation phenomenon of the gating mechanism and augmented it with a refining module.
arXiv Detail & Related papers (2025-02-03T18:03:13Z)
How to Alleviate Catastrophic Forgetting in LLMs Finetuning? Hierarchical Layer-Wise and Element-Wise Regularization [15.434072331989878]
Large Language Models (LLMs) exhibit strong general language capabilities. Fine-tuning these models on domain-specific tasks often leads to catastrophic forgetting, where the model overwrites or loses essential knowledge acquired during pretraining. We propose a novel approach to compute the element-wise importance of model parameters crucial for preserving general knowledge during fine-tuning.
arXiv Detail & Related papers (2025-01-23T13:54:53Z)
Hysteresis Activation Function for Efficient Inference [3.5223695602582614]
We propose a Hysteresis Rectified Linear Unit (HeLU) to address the dying ReLU'' problem with minimal complexity. Unlike traditional activation functions with fixed thresholds for training and inference, HeLU employs a variable threshold that refines the backpropagation.
arXiv Detail & Related papers (2024-11-15T20:46:58Z)
LOCAL: Learning with Orientation Matrix to Infer Causal Structure from Time Series Data [51.47827479376251]
LOCAL is a highly efficient, easy-to-implement, and constraint-free method for recovering dynamic causal structures. Asymptotic Causal Learning Mask (ACML) and Dynamic Graph Learning (DGPL) Experiments on synthetic and real-world datasets demonstrate that LOCAL significantly outperforms existing methods.
arXiv Detail & Related papers (2024-10-25T10:48:41Z)
RecurFormer: Not All Transformer Heads Need Self-Attention [14.331807060659902]
Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference. We propose RecurFormer, a novel architecture that replaces certain attention heads with linear recurrent neural networks.
arXiv Detail & Related papers (2024-10-10T15:24:12Z)
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning [27.991291785091736]
We address the problem of large language model (LLM) unlearning, aiming to remove unwanted data influences and associated model capabilities. We propose a simple yet effective unlearning optimization framework, called SimNPO, showing that'simplicity' in removing the reliance on a reference model benefits unlearning.
arXiv Detail & Related papers (2024-10-09T17:58:12Z)
GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval [80.96706764868898]
We present a new Low-light Image Enhancement (LLIE) network via Generative LAtent feature based codebook REtrieval (GLARE) We develop a generative Invertible Latent Normalizing Flow (I-LNF) module to align the LL feature distribution to NL latent representations, guaranteeing the correct code retrieval in the codebook. Experiments confirm the superior performance of GLARE on various benchmark datasets and real-world data.
arXiv Detail & Related papers (2024-07-17T09:40:15Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z)
CRaSh: Clustering, Removing, and Sharing Enhance Fine-tuning without Full Large Language Model [22.870512676002463]
This paper focuses on Offsite-Tuning (OFT), a representative technique that transfers transformer blocks between centralized LLMs and downstream emulators. Inspired by these observations, we propose CRaSh, involving Clustering, Removing, and Sharing, a training-free strategy to derive improved emulators from LLMs. Our findings demonstrate a linear connectivity among these optima falling over the same basin, thereby highlighting the effectiveness of CRaSh and OFT.
arXiv Detail & Related papers (2023-10-24T03:08:58Z)
Low-Resolution Self-Attention for Semantic Segmentation [93.30597515880079]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost. Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z)
GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization [52.55628139825667]
Federated Learning (FL) has emerged as a promising distributed machine learning framework to preserve clients' privacy. Recent studies find that an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. We propose textbfGradient textbfInversion over textbfFeature textbfDomains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers.
arXiv Detail & Related papers (2023-08-09T04:34:21Z)
Efficient Semantic Image Synthesis via Class-Adaptive Normalization [116.63715955932174]
Class-adaptive normalization (CLADE) is a lightweight but equally-effective variant that is only adaptive to semantic class. We introduce intra-class positional map encoding calculated from semantic layouts to modulate the normalization parameters of CLADE. The proposed CLADE can be generalized to different SPADE-based methods while achieving comparable generation quality compared to SPADE.
arXiv Detail & Related papers (2020-12-08T18:59:32Z)
Evolving Normalization-Activation Layers [100.82879448303805]
We develop efficient rejection protocols to quickly filter out candidate layers that do not work well. Our method leads to the discovery of EvoNorms, a set of new normalization-activation layers with novel, and sometimes surprising structures. Our experiments show that EvoNorms work well on image classification models including ResNets, MobileNets and EfficientNets.
arXiv Detail & Related papers (2020-04-06T19:52:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.