A Theoretical View on Sparsely Activated Networks
- URL: http://arxiv.org/abs/2208.04461v1
- Date: Mon, 8 Aug 2022 23:14:48 GMT
- Title: A Theoretical View on Sparsely Activated Networks
- Authors: Cenk Baykal, Nishanth Dikkala, Rina Panigrahy, Cyrus Rashtchian, Xin
Wang
- Abstract summary: We present a formal model of data-dependent sparse networks that captures salient aspects of popular architectures.
We then introduce a routing function based on locality sensitive hashing (LSH) that enables us to reason about how well sparse networks approximate target functions.
We prove that sparse networks can match the approximation power of dense networks on Lipschitz functions.
- Score: 21.156069843782017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep and wide neural networks successfully fit very complex functions today,
but dense models are starting to be prohibitively expensive for inference. To
mitigate this, one promising direction is networks that activate a sparse
subgraph of the network. The subgraph is chosen by a data-dependent routing
function, enforcing a fixed mapping of inputs to subnetworks (e.g., the Mixture
of Experts (MoE) paradigm in Switch Transformers). However, prior work is
largely empirical, and while existing routing functions work well in practice,
they do not lead to theoretical guarantees on approximation ability. We aim to
provide a theoretical explanation for the power of sparse networks. As our
first contribution, we present a formal model of data-dependent sparse networks
that captures salient aspects of popular architectures. We then introduce a
routing function based on locality sensitive hashing (LSH) that enables us to
reason about how well sparse networks approximate target functions. After
representing LSH-based sparse networks with our model, we prove that sparse
networks can match the approximation power of dense networks on Lipschitz
functions. Applying LSH on the input vectors means that the experts interpolate
the target function in different subregions of the input space. To support our
theory, we define various datasets based on Lipschitz target functions, and we
show that sparse networks give a favorable trade-off between number of active
units and approximation quality.
Related papers
- ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models [9.96121040675476]
This manuscript explores how properties of functions learned by neural networks of depth greater than two layers affect predictions.
Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs.
arXiv Detail & Related papers (2023-05-24T22:10:12Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - Probabilistic Verification of ReLU Neural Networks via Characteristic
Functions [11.489187712465325]
We use ideas from probability theory in the frequency domain to provide probabilistic verification guarantees for ReLU neural networks.
We interpret a (deep) feedforward neural network as a discrete dynamical system over a finite horizon.
We obtain the corresponding cumulative distribution function of the output set, which can be used to check if the network is performing as expected.
arXiv Detail & Related papers (2022-12-03T05:53:57Z) - Globally Gated Deep Linear Networks [3.04585143845864]
We introduce Globally Gated Deep Linear Networks (GGDLNs) where gating units are shared among all processing units in each layer.
We derive exact equations for the generalization properties in these networks in the finite-width thermodynamic limit.
Our work is the first exact theoretical solution of learning in a family of nonlinear networks with finite width.
arXiv Detail & Related papers (2022-10-31T16:21:56Z) - Robust Training and Verification of Implicit Neural Networks: A
Non-Euclidean Contractive Approach [64.23331120621118]
This paper proposes a theoretical and computational framework for training and robustness verification of implicit neural networks.
We introduce a related embedded network and show that the embedded network can be used to provide an $ell_infty$-norm box over-approximation of the reachable sets of the original network.
We apply our algorithms to train implicit neural networks on the MNIST dataset and compare the robustness of our models with the models trained via existing approaches in the literature.
arXiv Detail & Related papers (2022-08-08T03:13:24Z) - Benefits of Overparameterized Convolutional Residual Networks: Function
Approximation under Smoothness Constraint [48.25573695787407]
We prove that large ConvResNets can not only approximate a target function in terms of function value, but also exhibit sufficient first-order smoothness.
Our theory partially justifies the benefits of using deep and wide networks in practice.
arXiv Detail & Related papers (2022-06-09T15:35:22Z) - The Connection Between Approximation, Depth Separation and Learnability
in Neural Networks [70.55686685872008]
We study the connection between learnability and approximation capacity.
We show that learnability with deep networks of a target function depends on the ability of simpler classes to approximate the target.
arXiv Detail & Related papers (2021-01-31T11:32:30Z) - Faster Convergence in Deep-Predictive-Coding Networks to Learn Deeper
Representations [12.716429755564821]
Deep-predictive-coding networks (DPCNs) are hierarchical, generative models that rely on feed-forward and feed-back connections.
A crucial element of DPCNs is a forward-backward inference procedure to uncover sparse states of a dynamic model.
We propose an optimization strategy, with better empirical and theoretical convergence, based on accelerated proximal gradients.
arXiv Detail & Related papers (2021-01-18T02:30:13Z) - Network Adjustment: Channel Search Guided by FLOPs Utilization Ratio [101.84651388520584]
This paper presents a new framework named network adjustment, which considers network accuracy as a function of FLOPs.
Experiments on standard image classification datasets and a wide range of base networks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-06T15:51:00Z) - ReActNet: Towards Precise Binary Neural Network with Generalized
Activation Functions [76.05981545084738]
We propose several ideas for enhancing a binary network to close its accuracy gap from real-valued networks without incurring any additional computational cost.
We first construct a baseline network by modifying and binarizing a compact real-valued network with parameter-free shortcuts.
We show that the proposed ReActNet outperforms all the state-of-the-arts by a large margin.
arXiv Detail & Related papers (2020-03-07T02:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.