Self-Attention Networks Can Process Bounded Hierarchical Languages
- URL: http://arxiv.org/abs/2105.11115v1
- Date: Mon, 24 May 2021 06:42:58 GMT
- Title: Self-Attention Networks Can Process Bounded Hierarchical Languages
- Authors: Shunyu Yao, Binghui Peng, Christos Papadimitriou, Karthik Narasimhan
- Abstract summary: We prove that self-attention networks can process $mathsfDyck_k, D$, a subset of $mathsfDyck_k, D$ with depth bounded by $D$.
Experiments show that self-attention networks trained on $mathsfDyck_k, D$ generalize to longer inputs with near-perfect accuracy.
- Score: 24.75432474021856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite their impressive performance in NLP, self-attention networks were
recently proved to be limited for processing formal languages with hierarchical
structure, such as $\mathsf{Dyck}_k$, the language consisting of well-nested
parentheses of $k$ types. This suggested that natural language can be
approximated well with models that are too weak for formal languages, or that
the role of hierarchy and recursion in natural language might be limited. We
qualify this implication by proving that self-attention networks can process
$\mathsf{Dyck}_{k, D}$, the subset of $\mathsf{Dyck}_{k}$ with depth bounded by
$D$, which arguably better captures the bounded hierarchical structure of
natural language. Specifically, we construct a hard-attention network with
$D+1$ layers and $O(\log k)$ memory size (per token per layer) that recognizes
$\mathsf{Dyck}_{k, D}$, and a soft-attention network with two layers and
$O(\log k)$ memory size that generates $\mathsf{Dyck}_{k, D}$. Experiments show
that self-attention networks trained on $\mathsf{Dyck}_{k, D}$ generalize to
longer inputs with near-perfect accuracy, and also verify the theoretical
memory advantage of self-attention networks over recurrent networks.
Related papers
- Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers.
It is common to instead use proxy tasks that are similar in only an informal sense.
We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - Neural Networks Generalize on Low Complexity Data [5.678271181959529]
We show that feedforward neural networks with ReLU activation generalize on low complexity data.
We define this simple programming language, along with a notion of description length of such networks.
We provide examples on basic computational tasks, such as checking primality of a natural number.
arXiv Detail & Related papers (2024-09-19T03:54:49Z) - Deep Neural Networks: Multi-Classification and Universal Approximation [0.0]
We demonstrate that a ReLU deep neural network with a width of $2$ and a depth of $2N+4M-1$ layers can achieve finite sample memorization for any dataset comprising $N$ elements.
We also provide depth estimates for approximating $W1,p$ functions and width estimates for approximating $Lp(Omega;mathbbRm)$ for $mgeq1$.
arXiv Detail & Related papers (2024-09-10T14:31:21Z) - Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks.
For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error.
This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z) - Most Neural Networks Are Almost Learnable [52.40331776572531]
We show that for any fixed $epsilon>0$ and depth $i$, there is a poly-time algorithm that learns random Xavier networks of depth $i$.
The algorithm runs in time and sample complexity of $(bard)mathrmpoly(epsilon-1)$, where $bar d$ is the size of the network.
For some cases of sigmoid and ReLU-like activations the bound can be improved to $(bard)mathrmpolylog(eps
arXiv Detail & Related papers (2023-05-25T22:27:42Z) - Distributed Saddle-Point Problems Under Similarity [173.19083235638104]
We show that a given suboptimality $epsilon0$ is achieved master/workers networks in $Omegabig.
We then propose algorithms matching the lower bounds either types of networks (up to log-overs)
We assess effectiveness of the proposed algorithms on a robust logistic regression problem.
arXiv Detail & Related papers (2021-07-22T14:25:16Z) - An Exponential Improvement on the Memorization Capacity of Deep
Threshold Networks [40.489350374378645]
We prove that $widetildemathcalO(e1/delta2+sqrtn)$ neurons and $widetildemathcalO(fracddelta+n)$ weights are sufficient.
We also prove new lower bounds by connecting in neural networks to the purely geometric problem of separating $n$ points on a sphere using hyperplanes.
arXiv Detail & Related papers (2021-06-14T19:42:32Z) - RNNs can generate bounded hierarchical languages with optimal memory [113.73133308478612]
We show that RNNs can efficiently generate bounded hierarchical languages that reflect the scaffolding of natural language syntax.
We introduce Dyck-($k$,$m$), the language of well-nested brackets (of $k$ types) and $m$-bounded nesting depth.
We prove that an RNN with $O(m log k)$ hidden units suffices, an exponential reduction in memory, by an explicit construction.
arXiv Detail & Related papers (2020-10-15T04:42:29Z) - Sharp Representation Theorems for ReLU Networks with Precise Dependence
on Depth [26.87238691716307]
We prove sharp-free representation results for neural networks with $D$ ReLU layers under square loss.
Our results confirm the prevailing hypothesis that deeper networks are better at representing less smooth functions.
arXiv Detail & Related papers (2020-06-07T05:25:06Z) - Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective.
We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.