Related papers: On the Modularity of Hypernetworks

On the Modularity of Hypernetworks

URL: http://arxiv.org/abs/2002.10006v2
Date: Mon, 2 Nov 2020 12:22:00 GMT
Title: On the Modularity of Hypernetworks
Authors: Tomer Galanti, Lior Wolf
Abstract summary: We show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method.
Score: 103.1147622394852
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the context of learning to map an input $I$ to a function $h_I:\mathcal{X}\to \mathbb{R}$, two alternative methods are compared: (i) an embedding-based method, which learns a fixed function in which $I$ is encoded as a conditioning signal $e(I)$ and the learned function takes the form $h_I(x) = q(x,e(I))$, and (ii) hypernetworks, in which the weights $\theta_I$ of the function $h_I(x) = g(x;\theta_I)$ are given by a hypernetwork $f$ as $\theta_I=f(I)$. In this paper, we define the property of modularity as the ability to effectively learn a different function for each input instance $I$. For this purpose, we adopt an expressivity perspective of this property and extend the theory of Devore et al. 1996 and provide a lower bound on the complexity (number of trainable parameters) of neural networks as function approximators, by eliminating the requirements for the approximation method to be robust. Our results are then used to compare the complexities of $q$ and $g$, showing that under certain conditions and when letting the functions $e$ and $f$ be as large as we wish, $g$ can be smaller than $q$ by orders of magnitude. This sheds light on the modularity of hypernetworks in comparison with the embedding-based method. Besides, we show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method.

Related papers

Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations [40.77319247558742]
We study the computational complexity of learning a target function $f_*:mathbbRdtomathbbR$ with additive structure. We prove that a large subset of $f_*$ can be efficiently learned by gradient training of a two-layer neural network.
arXiv Detail & Related papers (2024-06-17T17:59:17Z)
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data. We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z)
Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks. For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z)
Neural Networks can Learn Representations with Gradient Descent [68.95262816363288]
In specific regimes, neural networks trained by gradient descent behave like kernel methods. In practice, it is known that neural networks strongly outperform their associated kernels.
arXiv Detail & Related papers (2022-06-30T09:24:02Z)
On the Provable Generalization of Recurrent Neural Networks [7.115768009778412]
We analyze the training and generalization for Recurrent Neural Network (RNN) We prove a generalization error bound to learn functions without normalized conditions. We also prove a novel result to learn N-variables functions of input sequence.
arXiv Detail & Related papers (2021-09-29T02:06:33Z)
On minimal representations of shallow ReLU networks [0.0]
We show that the minimal representation for $f$ uses either $n$, $n+1$ or $n+2$ neurons. In particular, where the input layer is one-dimensional, minimal representations always use at most $n+1$ neurons but in all higher dimensional settings there are functions for which $n+2$ neurons are needed.
arXiv Detail & Related papers (2021-08-12T10:22:24Z)
On the Power of Multitask Representation Learning in Linear MDP [61.58929164172968]
This paper presents analyses for the statistical benefit of multitask representation learning in linear Markov Decision Process (MDP) We first discover a emphLeast-Activated-Feature-Abundance (LAFA) criterion, denoted as $kappa$, with which we prove that a straightforward least-square algorithm learns a policy which is $tildeO(H2sqrtfrackappa mathcalC(Phi)2 kappa dNT+frackappa dn)
arXiv Detail & Related papers (2021-06-15T11:21:06Z)
Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation [30.137884459159107]
We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces. We develop a simple, iterative learning algorithm that finds $epsilon$-Schmidt $Q$-function with sample complexity of $widetildeO(frac1epsilonmax(d1), d_2)+2)$ when the optimal $Q$-function has low rank $r$ and the factor $$ is below a certain threshold.
arXiv Detail & Related papers (2020-06-11T00:55:35Z)
Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective. We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.