A Theoretical Explanation of Activation Sparsity through Flat Minima and
Adversarial Robustness
- URL: http://arxiv.org/abs/2309.03004v4
- Date: Thu, 26 Oct 2023 15:02:02 GMT
- Title: A Theoretical Explanation of Activation Sparsity through Flat Minima and
Adversarial Robustness
- Authors: Ze Peng, Lei Qi, Yinghuan Shi, Yang Gao
- Abstract summary: A recent empirical observation of activation sparsity in blocks offers an opportunity to drastically reduce computation costs for free.
We propose the notion of sparsity as one source of activation sparsity and a theoretical explanation based on it.
- Score: 29.87592869483743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recent empirical observation (Li et al., 2022b) of activation sparsity in
MLP blocks offers an opportunity to drastically reduce computation costs for
free. Although having attributed it to training dynamics, existing theoretical
explanations of activation sparsity are restricted to shallow networks, small
training steps and special training, despite its emergence in deep models
standardly trained for a large number of steps. To fill these gaps, we propose
the notion of gradient sparsity as one source of activation sparsity and a
theoretical explanation based on it that sees sparsity a necessary step to
adversarial robustness w.r.t. hidden features and parameters, which is
approximately the flatness of minima for well-learned models. The theory
applies to standardly trained LayerNorm-ed MLPs, and further to Transformers or
other architectures trained with weight noises. Eliminating other sources of
flatness except for sparsity, we discover the phenomenon that the ratio between
the largest and smallest non-zero singular values of weight matrices is small.
When discussing the emergence of this spectral concentration, we use random
matrix theory (RMT) as a powerful tool to analyze stochastic gradient noises.
Validational experiments are conducted to verify our gradient-sparsity-based
explanation. We propose two plug-and-play modules for both training and
finetuning for sparsity. Experiments on ImageNet-1k and C4 demonstrate their
50% sparsity improvements, indicating further potential cost reduction in both
training and inference.
Related papers
- R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.
Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z) - Understanding Flatness in Generative Models: Its Role and Benefits [9.775257597631244]
We investigate the role of loss surface flatness in generative models, both theoretically and empirically.
We establish a theoretical claim that flatter minima improve robustness against perturbations in target prior distributions.
We demonstrate that flat minima in diffusion models indeed improves generative performance but also robustness.
arXiv Detail & Related papers (2025-03-14T04:38:53Z) - Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.
This refers to the cumulative effect of reconstruction errors throughout the sparsification process.
We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z) - Learning Neural Networks with Sparse Activations [42.88109060676769]
In transformer networks, the activations in the hidden layer of this block tend to be extremely sparse on any given input.
Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of em activation sparsity appears to be harder to exploit.
We present a variety of results showing that classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts.
arXiv Detail & Related papers (2024-06-26T00:11:13Z) - Towards Training Without Depth Limits: Batch Normalization Without
Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth.
We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth.
We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z) - The Emergence of Essential Sparsity in Large Pre-trained Models: The
Weights that Matter [113.35761858962522]
This paper studies induced sparse patterns across multiple large pre-trained vision and language transformers.
We propose the existence of essential sparsity defined with a sharp dropping point beyond which the performance declines much faster.
We also find essential sparsity to hold valid for N:M sparsity patterns as well as on modern-scale large language models.
arXiv Detail & Related papers (2023-06-06T15:49:09Z) - Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z) - Compact Model Training by Low-Rank Projection with Energy Transfer [13.446719541044663]
Low-rankness plays an important role in traditional machine learning, but is not so popular in deep learning.
Previous low-rank network compression methods compress networks by approximating pre-trained models and re-training.
We devise a new training method, low-rank projection with energy transfer, that trains low-rank compressed networks from scratch and competitive performance.
arXiv Detail & Related papers (2022-04-12T06:53:25Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Autocalibration and Tweedie-dominance for Insurance Pricing with Machine
Learning [0.0]
It is shown that minimizing deviance involves a trade-off between the integral of weighted differences of lower partial moments and the bias measured on a specific scale.
This new method to correct for bias adds extra local GLM step to the analysis.
The convex order appears to be the natural tool to compare competing models.
arXiv Detail & Related papers (2021-03-05T12:40:30Z) - Reintroducing Straight-Through Estimators as Principled Methods for
Stochastic Binary Networks [85.94999581306827]
Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights.
Many successful experimental results have been achieved with empirical straight-through (ST) approaches.
At the same time, ST methods can be truly derived as estimators in the binary network (SBN) model with Bernoulli weights.
arXiv Detail & Related papers (2020-06-11T23:58:18Z) - A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.