A Theoretical Explanation of Activation Sparsity through Flat Minima and
Adversarial Robustness
- URL: http://arxiv.org/abs/2309.03004v4
- Date: Thu, 26 Oct 2023 15:02:02 GMT
- Title: A Theoretical Explanation of Activation Sparsity through Flat Minima and
Adversarial Robustness
- Authors: Ze Peng, Lei Qi, Yinghuan Shi, Yang Gao
- Abstract summary: A recent empirical observation of activation sparsity in blocks offers an opportunity to drastically reduce computation costs for free.
We propose the notion of sparsity as one source of activation sparsity and a theoretical explanation based on it.
- Score: 29.87592869483743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recent empirical observation (Li et al., 2022b) of activation sparsity in
MLP blocks offers an opportunity to drastically reduce computation costs for
free. Although having attributed it to training dynamics, existing theoretical
explanations of activation sparsity are restricted to shallow networks, small
training steps and special training, despite its emergence in deep models
standardly trained for a large number of steps. To fill these gaps, we propose
the notion of gradient sparsity as one source of activation sparsity and a
theoretical explanation based on it that sees sparsity a necessary step to
adversarial robustness w.r.t. hidden features and parameters, which is
approximately the flatness of minima for well-learned models. The theory
applies to standardly trained LayerNorm-ed MLPs, and further to Transformers or
other architectures trained with weight noises. Eliminating other sources of
flatness except for sparsity, we discover the phenomenon that the ratio between
the largest and smallest non-zero singular values of weight matrices is small.
When discussing the emergence of this spectral concentration, we use random
matrix theory (RMT) as a powerful tool to analyze stochastic gradient noises.
Validational experiments are conducted to verify our gradient-sparsity-based
explanation. We propose two plug-and-play modules for both training and
finetuning for sparsity. Experiments on ImageNet-1k and C4 demonstrate their
50% sparsity improvements, indicating further potential cost reduction in both
training and inference.
Related papers
- Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.
This refers to the cumulative effect of reconstruction errors throughout the sparsification process.
We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z) - Learning Neural Networks with Sparse Activations [42.88109060676769]
In transformer networks, the activations in the hidden layer of this block tend to be extremely sparse on any given input.
Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of em activation sparsity appears to be harder to exploit.
We present a variety of results showing that classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts.
arXiv Detail & Related papers (2024-06-26T00:11:13Z) - Towards Training Without Depth Limits: Batch Normalization Without
Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth.
We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth.
We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z) - The Emergence of Essential Sparsity in Large Pre-trained Models: The
Weights that Matter [113.35761858962522]
This paper studies induced sparse patterns across multiple large pre-trained vision and language transformers.
We propose the existence of essential sparsity defined with a sharp dropping point beyond which the performance declines much faster.
We also find essential sparsity to hold valid for N:M sparsity patterns as well as on modern-scale large language models.
arXiv Detail & Related papers (2023-06-06T15:49:09Z) - Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z) - Compact Model Training by Low-Rank Projection with Energy Transfer [13.446719541044663]
Low-rankness plays an important role in traditional machine learning, but is not so popular in deep learning.
Previous low-rank network compression methods compress networks by approximating pre-trained models and re-training.
We devise a new training method, low-rank projection with energy transfer, that trains low-rank compressed networks from scratch and competitive performance.
arXiv Detail & Related papers (2022-04-12T06:53:25Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Autocalibration and Tweedie-dominance for Insurance Pricing with Machine
Learning [0.0]
It is shown that minimizing deviance involves a trade-off between the integral of weighted differences of lower partial moments and the bias measured on a specific scale.
This new method to correct for bias adds extra local GLM step to the analysis.
The convex order appears to be the natural tool to compare competing models.
arXiv Detail & Related papers (2021-03-05T12:40:30Z) - A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.