Related papers: A Theoretical Explanation of Activation Sparsity through Flat Minima and Adversarial Robustness

A Theoretical Explanation of Activation Sparsity through Flat Minima and Adversarial Robustness

URL: http://arxiv.org/abs/2309.03004v4
Date: Thu, 26 Oct 2023 15:02:02 GMT
Title: A Theoretical Explanation of Activation Sparsity through Flat Minima and Adversarial Robustness
Authors: Ze Peng, Lei Qi, Yinghuan Shi, Yang Gao
Abstract summary: A recent empirical observation of activation sparsity in blocks offers an opportunity to drastically reduce computation costs for free. We propose the notion of sparsity as one source of activation sparsity and a theoretical explanation based on it.
Score: 29.87592869483743
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A recent empirical observation (Li et al., 2022b) of activation sparsity in MLP blocks offers an opportunity to drastically reduce computation costs for free. Although having attributed it to training dynamics, existing theoretical explanations of activation sparsity are restricted to shallow networks, small training steps and special training, despite its emergence in deep models standardly trained for a large number of steps. To fill these gaps, we propose the notion of gradient sparsity as one source of activation sparsity and a theoretical explanation based on it that sees sparsity a necessary step to adversarial robustness w.r.t. hidden features and parameters, which is approximately the flatness of minima for well-learned models. The theory applies to standardly trained LayerNorm-ed MLPs, and further to Transformers or other architectures trained with weight noises. Eliminating other sources of flatness except for sparsity, we discover the phenomenon that the ratio between the largest and smallest non-zero singular values of weight matrices is small. When discussing the emergence of this spectral concentration, we use random matrix theory (RMT) as a powerful tool to analyze stochastic gradient noises. Validational experiments are conducted to verify our gradient-sparsity-based explanation. We propose two plug-and-play modules for both training and finetuning for sparsity. Experiments on ImageNet-1k and C4 demonstrate their 50% sparsity improvements, indicating further potential cost reduction in both training and inference.

Related papers

R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs. Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
Understanding Flatness in Generative Models: Its Role and Benefits [9.775257597631244]
We investigate the role of loss surface flatness in generative models, both theoretically and empirically. We establish a theoretical claim that flatter minima improve robustness against perturbations in target prior distributions. We demonstrate that flat minima in diffusion models indeed improves generative performance but also robustness.
arXiv Detail & Related papers (2025-03-14T04:38:53Z)
Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. This refers to the cumulative effect of reconstruction errors throughout the sparsification process. We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z)
Learning Neural Networks with Sparse Activations [42.88109060676769]
In transformer networks, the activations in the hidden layer of this block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of em activation sparsity appears to be harder to exploit. We present a variety of results showing that classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts.
arXiv Detail & Related papers (2024-06-26T00:11:13Z)
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth. We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth. We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z)
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter [113.35761858962522]
This paper studies induced sparse patterns across multiple large pre-trained vision and language transformers. We propose the existence of essential sparsity defined with a sharp dropping point beyond which the performance declines much faster. We also find essential sparsity to hold valid for N:M sparsity patterns as well as on modern-scale large language models.
arXiv Detail & Related papers (2023-06-06T15:49:09Z)
Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models. The underlying mechanics behind knowledge distillation (KD) are still not fully understood. We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z)
Compact Model Training by Low-Rank Projection with Energy Transfer [13.446719541044663]
Low-rankness plays an important role in traditional machine learning, but is not so popular in deep learning. Previous low-rank network compression methods compress networks by approximating pre-trained models and re-training. We devise a new training method, low-rank projection with energy transfer, that trains low-rank compressed networks from scratch and competitive performance.
arXiv Detail & Related papers (2022-04-12T06:53:25Z)
Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models. Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely. Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z)
Autocalibration and Tweedie-dominance for Insurance Pricing with Machine Learning [0.0]
It is shown that minimizing deviance involves a trade-off between the integral of weighted differences of lower partial moments and the bias measured on a specific scale. This new method to correct for bias adds extra local GLM step to the analysis. The convex order appears to be the natural tool to compare competing models.
arXiv Detail & Related papers (2021-03-05T12:40:30Z)
Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks [85.94999581306827]
Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. Many successful experimental results have been achieved with empirical straight-through (ST) approaches. At the same time, ST methods can be truly derived as estimators in the binary network (SBN) model with Bernoulli weights.
arXiv Detail & Related papers (2020-06-11T23:58:18Z)
A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.