AlgebraNets
- URL: http://arxiv.org/abs/2006.07360v2
- Date: Tue, 16 Jun 2020 16:10:13 GMT
- Title: AlgebraNets
- Authors: Jordan Hoffmann, Simon Schmitt, Simon Osindero, Karen Simonyan, Erich
Elsen
- Abstract summary: We study alternative algebras as number representations using the enwiki8 and WikiText-103 datasets.
We consider $mathbbC$, $mathbbH$, $M_2(mathbbR)$, $M_3(mathbbR)$ and $M_4(mathbbR)$.
multiplication in these algebras has higher compute density than real multiplication.
- Score: 35.311476442807766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural networks have historically been built layerwise from the set of
functions in ${f: \mathbb{R}^n \to \mathbb{R}^m }$, i.e. with activations and
weights/parameters represented by real numbers, $\mathbb{R}$. Our work
considers a richer set of objects for activations and weights, and undertakes a
comprehensive study of alternative algebras as number representations by
studying their performance on two challenging problems: large-scale image
classification using the ImageNet dataset and language modeling using the
enwiki8 and WikiText-103 datasets. We denote this broader class of models as
AlgebraNets. Our findings indicate that the conclusions of prior work, which
explored neural networks constructed from $\mathbb{C}$ (complex numbers) and
$\mathbb{H}$ (quaternions) on smaller datasets, do not always transfer to these
challenging settings. However, our results demonstrate that there are
alternative algebras which deliver better parameter and computational
efficiency compared with $\mathbb{R}$. We consider $\mathbb{C}$, $\mathbb{H}$,
$M_{2}(\mathbb{R})$ (the set of $2\times2$ real-valued matrices),
$M_{2}(\mathbb{C})$, $M_{3}(\mathbb{R})$ and $M_{4}(\mathbb{R})$. Additionally,
we note that multiplication in these algebras has higher compute density than
real multiplication, a useful property in situations with inherently limited
parameter reuse such as auto-regressive inference and sparse neural networks.
We therefore investigate how to induce sparsity within AlgebraNets. We hope
that our strong results on large-scale, practical benchmarks will spur further
exploration of these unconventional architectures which challenge the default
choice of using real numbers for neural network weights and activations.
Related papers
- Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations [40.77319247558742]
We study the computational complexity of learning a target function $f_*:mathbbRdtomathbbR$ with additive structure.
We prove that a large subset of $f_*$ can be efficiently learned by gradient training of a two-layer neural network.
arXiv Detail & Related papers (2024-06-17T17:59:17Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks.
For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error.
This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z) - A Unified Scheme of ResNet and Softmax [8.556540804058203]
We provide a theoretical analysis of the regression problem: $| langle exp(Ax) + A x, bf 1_n rangle-1 ( exp(Ax) + Ax )
This regression problem is a unified scheme that combines softmax regression and ResNet, which has never been done before.
arXiv Detail & Related papers (2023-09-23T21:41:01Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Training Fully Connected Neural Networks is $\exists\mathbb{R}$-Complete [4.170994234169557]
We consider the problem of finding weights and biases for a two-layer fully connected neural network to fit a given set of data points as well as possible, also known as EmpiricalRiskmization.
We prove that algebra numbers of arbitrarily large degree are required as weights to be able to train some instances to optimality, even if all data points are rational.
A consequence of this is that a search algorithm like the one by Basu, Mianjy and Mukherjee [ICLR 2018] is impossible for networks with more than one output dimension, unless $mathsfNP=exists
arXiv Detail & Related papers (2022-04-04T10:28:11Z) - Deep Learning in High Dimension: Neural Network Approximation of
Analytic Functions in $L^2(\mathbb{R}^d,\gamma_d)$ [0.0]
We prove expression rates for analytic functions $f:mathbbRdtomathbbR$ in the norm of $L2(mathbbRd,gamma_d)$.
We consider in particular ReLU and ReLU$k$ activations for integer $kgeq 2$.
As an application, we prove expression rate bounds of deep ReLU-NNs for response surfaces of elliptic PDEs with log-Gaussian random field inputs.
arXiv Detail & Related papers (2021-11-13T09:54:32Z) - Efficient Algorithms for Learning Depth-2 Neural Networks with General
ReLU Activations [27.244958998196623]
We present time and sample efficient algorithms for learning an unknown depth-2 feedforward neural network with general ReLU activations.
In particular, we consider learning an unknown network of the form $f(x) = amathsfTsigma(WmathsfTx+b)$, where $x$ is drawn from the Gaussian distribution, and $sigma(t) := max(t,0)$ is the ReLU activation.
arXiv Detail & Related papers (2021-07-21T17:06:03Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Deep Polynomial Neural Networks [77.70761658507507]
$Pi$Nets are a new class of function approximators based on expansions.
$Pi$Nets produce state-the-art results in three challenging tasks, i.e. image generation, face verification and 3D mesh representation learning.
arXiv Detail & Related papers (2020-06-20T16:23:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.