Exact Learning of Permutations for Nonzero Binary Inputs with Logarithmic Training Size and Quadratic Ensemble Complexity
- URL: http://arxiv.org/abs/2502.16763v1
- Date: Mon, 24 Feb 2025 00:50:02 GMT
- Title: Exact Learning of Permutations for Nonzero Binary Inputs with Logarithmic Training Size and Quadratic Ensemble Complexity
- Authors: George Giapitzakis, Artur Back de Luca, Kimon Fountoulakis,
- Abstract summary: This paper focuses on two-layer fully connected feed-forward neural networks and the task of learning permutations on nonzero binary inputs.<n>We show that in the infinite width Neural Tangent Kernel (NTK) regime, an ensemble of such networks independently trained with gradient descent on only the $k$ standard basis vectors out of $2k - 1$ possible inputs successfully learns any fixed permutation of length $k$ with arbitrarily high probability.
- Score: 5.3800094588915375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability of an architecture to realize permutations is quite fundamental. For example, Large Language Models need to be able to correctly copy (and perhaps rearrange) parts of the input prompt into the output. Classical universal approximation theorems guarantee the existence of parameter configurations that solve this task but offer no insights into whether gradient-based algorithms can find them. In this paper, we address this gap by focusing on two-layer fully connected feed-forward neural networks and the task of learning permutations on nonzero binary inputs. We show that in the infinite width Neural Tangent Kernel (NTK) regime, an ensemble of such networks independently trained with gradient descent on only the $k$ standard basis vectors out of $2^k - 1$ possible inputs successfully learns any fixed permutation of length $k$ with arbitrarily high probability. By analyzing the exact training dynamics, we prove that the network's output converges to a Gaussian process whose mean captures the ground truth permutation via sign-based features. We then demonstrate how averaging these runs (an "ensemble" method) and applying a simple rounding step yields an arbitrarily accurate prediction on any possible input unseen during training. Notably, the number of models needed to achieve exact learning with high probability (which we refer to as ensemble complexity) exhibits a linearithmic dependence on the input size $k$ for a single test input and a quadratic dependence when considering all test inputs simultaneously.
Related papers
- Automated Sizing and Training of Efficient Deep Autoencoders using
Second Order Algorithms [0.46040036610482665]
We propose a multi-step training method for generalized linear classifiers.
validation error is minimized by pruning of unnecessary inputs.
desired outputs are improved via a method similar to the Ho-Kashyap rule.
arXiv Detail & Related papers (2023-08-11T16:48:31Z) - Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks [69.38572074372392]
We present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks.
Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks.
arXiv Detail & Related papers (2023-07-13T16:39:08Z) - Towards Better Out-of-Distribution Generalization of Neural Algorithmic
Reasoning Tasks [51.8723187709964]
We study the OOD generalization of neural algorithmic reasoning tasks.
The goal is to learn an algorithm from input-output pairs using deep neural networks.
arXiv Detail & Related papers (2022-11-01T18:33:20Z) - Quantum Extremal Learning [0.8937790536664091]
We propose a quantum algorithm for extremal learning', which is the process of finding the input to a hidden function that extremizes the function output.
The algorithm, called quantum extremal learning (QEL), consists of a parametric quantum circuit that is variationally trained to model data input-output relationships.
arXiv Detail & Related papers (2022-05-05T17:37:26Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Statistically Meaningful Approximation: a Case Study on Approximating
Turing Machines with Transformers [50.85524803885483]
This work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability.
We study SM approximation for two function classes: circuits and Turing machines.
arXiv Detail & Related papers (2021-07-28T04:28:55Z) - A quantum algorithm for training wide and deep classical neural networks [72.2614468437919]
We show that conditions amenable to classical trainability via gradient descent coincide with those necessary for efficiently solving quantum linear systems.
We numerically demonstrate that the MNIST image dataset satisfies such conditions.
We provide empirical evidence for $O(log n)$ training of a convolutional neural network with pooling.
arXiv Detail & Related papers (2021-07-19T23:41:03Z) - Alleviate Exposure Bias in Sequence Prediction \\ with Recurrent Neural
Networks [47.52214243454995]
A popular strategy to train recurrent neural networks (RNNs) is to take the ground truth as input at each time step.
We propose a fully differentiable training algorithm for RNNs to better capture long-term dependencies.
arXiv Detail & Related papers (2021-03-22T06:15:22Z) - Neural networks behave as hash encoders: An empirical study [79.38436088982283]
The input space of a neural network with ReLU-like activations is partitioned into multiple linear regions.
We demonstrate that this partition exhibits the following encoding properties across a variety of deep learning models.
Simple algorithms, such as $K$-Means, $K$-NN, and logistic regression, can achieve fairly good performance on both training and test data.
arXiv Detail & Related papers (2021-01-14T07:50:40Z) - A case where a spindly two-layer linear network whips any neural network
with a fully connected input layer [24.132345589750592]
We show that a sparse input layer is needed to sample efficiently learn sparse targets with gradient descent.
Surprisingly the same type of problem can be solved drastically more efficient by a simple 2-layer linear neural network.
arXiv Detail & Related papers (2020-10-16T20:49:58Z) - Why Are Convolutional Nets More Sample-Efficient than Fully-Connected
Nets? [33.51250867983687]
We show a natural task on which a provable sample complexity gap can be shown, for standard training algorithms.
We demonstrate a single target function, learning which on all possible distributions leads to an $O(1)$ vs $Omega(d2/varepsilon)$ gap.
Similar results are achieved for $ell$ regression and adaptive training algorithms, e.g. Adam and AdaGrad.
arXiv Detail & Related papers (2020-10-16T17:15:39Z) - Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective.
We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.