PHEW: Constructing Sparse Networks that Learn Fast and Generalize Well
without Training Data
- URL: http://arxiv.org/abs/2010.11354v2
- Date: Wed, 23 Jun 2021 13:34:45 GMT
- Title: PHEW: Constructing Sparse Networks that Learn Fast and Generalize Well
without Training Data
- Authors: Shreyas Malakarjun Patil, Constantine Dovrolis
- Abstract summary: We show how to design sparse neural networks for faster convergence, without any training data, using the Synflow-L2 algorithm.
We propose a new method to construct sparse networks, without any training data, referred to as Paths with Higher-Edge Weights (PHEW)
- Score: 10.01323660393278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Methods that sparsify a network at initialization are important in practice
because they greatly improve the efficiency of both learning and inference. Our
work is based on a recently proposed decomposition of the Neural Tangent Kernel
(NTK) that has decoupled the dynamics of the training process into a
data-dependent component and an architecture-dependent kernel - the latter
referred to as Path Kernel. That work has shown how to design sparse neural
networks for faster convergence, without any training data, using the
Synflow-L2 algorithm. We first show that even though Synflow-L2 is optimal in
terms of convergence, for a given network density, it results in sub-networks
with "bottleneck" (narrow) layers - leading to poor performance as compared to
other data-agnostic methods that use the same number of parameters. Then we
propose a new method to construct sparse networks, without any training data,
referred to as Paths with Higher-Edge Weights (PHEW). PHEW is a probabilistic
network formation method based on biased random walks that only depends on the
initial weights. It has similar path kernel properties as Synflow-L2 but it
generates much wider layers, resulting in better generalization and
performance. PHEW achieves significant improvements over the data-independent
SynFlow and SynFlow-L2 methods at a wide range of network densities.
Related papers
- Pushing the Efficiency Limit Using Structured Sparse Convolutions [82.31130122200578]
We propose Structured Sparse Convolution (SSC), which leverages the inherent structure in images to reduce the parameters in the convolutional filter.
We show that SSC is a generalization of commonly used layers (depthwise, groupwise and pointwise convolution) in efficient architectures''
Architectures based on SSC achieve state-of-the-art performance compared to baselines on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet classification benchmarks.
arXiv Detail & Related papers (2022-10-23T18:37:22Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations.
This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - Classifying high-dimensional Gaussian mixtures: Where kernel methods
fail and neural networks succeed [27.38015169185521]
We show theoretically that two-layer neural networks (2LNN) with only a few hidden neurons can beat the performance of kernel learning.
We show how over-parametrising the neural network leads to faster convergence, but does not improve its final performance.
arXiv Detail & Related papers (2021-02-23T15:10:15Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z) - Learning Sparse Filters in Deep Convolutional Neural Networks with a
l1/l2 Pseudo-Norm [5.3791844634527495]
Deep neural networks (DNNs) have proven to be efficient for numerous tasks, but come at a high memory and computation cost.
Recent research has shown that their structure can be more compact without compromising their performance.
We present a sparsity-inducing regularization term based on the ratio l1/l2 pseudo-norm defined on the filter coefficients.
arXiv Detail & Related papers (2020-07-20T11:56:12Z) - A Neural Network Approach for Online Nonlinear Neyman-Pearson
Classification [3.6144103736375857]
We propose a novel Neyman-Pearson (NP) classifier that is both online and nonlinear as the first time in the literature.
The proposed classifier operates on a binary labeled data stream in an online manner, and maximizes the detection power about a user-specified and controllable false positive rate.
Our algorithm is appropriate for large scale data applications and provides a decent false positive rate controllability with real time processing.
arXiv Detail & Related papers (2020-06-14T20:00:25Z) - Pruning neural networks without any data by iteratively conserving
synaptic flow [27.849332212178847]
Pruning the parameters of deep neural networks has generated intense interest due to potential savings in time, memory and energy.
Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainableworks.
We provide an affirmative answer to this question through theory driven algorithm design.
arXiv Detail & Related papers (2020-06-09T19:21:57Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.