Related papers: Dep-$L_0$: Improving $L_0$-based Network Sparsification via Dependency Modeling

Dep-$L_0$: Improving $L_0$-based Network Sparsification via Dependency Modeling

URL: http://arxiv.org/abs/2107.00070v1
Date: Wed, 30 Jun 2021 19:33:35 GMT
Title: Dep-$L_0$: Improving $L_0$-based Network Sparsification via Dependency Modeling
Authors: Yang Li and Shihao Ji
Abstract summary: Training deep neural networks with an $L_0$ regularization is one of the prominent approaches for network pruning or sparsification. We show that this method performs inconsistently on large-scale learning tasks, such as ResNet50 on ImageNet. We propose a dependency modeling of binary gates, which can be modeled effectively as a multi-layer perceptron.
Score: 6.081082481356211
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training deep neural networks with an $L_0$ regularization is one of the prominent approaches for network pruning or sparsification. The method prunes the network during training by encouraging weights to become exactly zero. However, recent work of Gale et al. reveals that although this method yields high compression rates on smaller datasets, it performs inconsistently on large-scale learning tasks, such as ResNet50 on ImageNet. We analyze this phenomenon through the lens of variational inference and find that it is likely due to the independent modeling of binary gates, the mean-field approximation, which is known in Bayesian statistics for its poor performance due to the crude approximation. To mitigate this deficiency, we propose a dependency modeling of binary gates, which can be modeled effectively as a multi-layer perceptron (MLP). We term our algorithm Dep-$L_0$ as it prunes networks via a dependency-enabled $L_0$ regularization. Extensive experiments on CIFAR10, CIFAR100 and ImageNet with VGG16, ResNet50, ResNet56 show that our Dep-$L_0$ outperforms the original $L_0$-HC algorithm of Louizos et al. by a significant margin, especially on ImageNet. Compared with the state-of-the-arts network sparsification algorithms, our dependency modeling makes the $L_0$-based sparsification once again very competitive on large-scale learning tasks. Our source code is available at https://github.com/leo-yangli/dep-l0.

Related papers

On the Convergence of Federated Averaging under Partial Participation for Over-parameterized Neural Networks [13.2844023993979]
Federated learning (FL) is a widely employed distributed paradigm for collaboratively machine learning models from multiple clients without sharing local data. In this paper, we show that FedAvg converges to a global minimum at a global rate at a global focus.
arXiv Detail & Related papers (2023-10-09T07:56:56Z)
Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth Soft-Thresholding [57.71603937699949]
We study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs. We show that the threshold on the number of training samples increases with the increase in the network width.
arXiv Detail & Related papers (2023-09-12T13:03:47Z)
The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$. We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
Algorithms for Efficiently Learning Low-Rank Neural Networks [12.916132936159713]
We study algorithms for learning low-rank neural networks. We present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network. We propose a novel low-rank framework for training low-rank $textitdeep$ networks.
arXiv Detail & Related papers (2022-02-02T01:08:29Z)
Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models. Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z)
Manifold Regularized Dynamic Network Pruning [102.24146031250034]
This paper proposes a new paradigm that dynamically removes redundant filters by embedding the manifold information of all instances into the space of pruned networks. The effectiveness of the proposed method is verified on several benchmarks, which shows better performance in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2021-03-10T03:59:03Z)
HALO: Learning to Prune Neural Networks with Shrinkage [5.283963846188862]
Deep neural networks achieve state-of-the-art performance in a variety of tasks by extracting a rich set of features from unstructured data. Modern techniques for inducing sparsity and reducing model size are (1) network pruning, (2) training with a sparsity inducing penalty, and (3) training a binary mask jointly with the weights of the network. We present a novel penalty called Hierarchical Adaptive Lasso which learns to adaptively sparsify weights of a given network via trainable parameters.
arXiv Detail & Related papers (2020-08-24T04:08:48Z)
A Revision of Neural Tangent Kernel-based Approaches for Neural Networks [34.75076385561115]
We use the neural tangent kernel to show that networks can fit any finite training sample perfectly. A simple and analytic kernel function was derived as indeed equivalent to a fully-trained network. Our tighter analysis resolves the scaling problem and enables the validation of the original NTK-based results.
arXiv Detail & Related papers (2020-07-02T05:07:55Z)
Fitting the Search Space of Weight-sharing NAS with Graph Convolutional Networks [100.14670789581811]
We train a graph convolutional network to fit the performance of sampled sub-networks. With this strategy, we achieve a higher rank correlation coefficient in the selected set of candidates.
arXiv Detail & Related papers (2020-04-17T19:12:39Z)
Towards Deep Learning Models Resistant to Large Perturbations [0.0]
Adversarial robustness has proven to be a required property of machine learning algorithms. We show that the well-established algorithm called "adversarial training" fails to train a deep neural network given a large, but reasonable, perturbation magnitude.
arXiv Detail & Related papers (2020-03-30T12:03:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.