FACT: the Features At Convergence Theorem for neural networks
- URL: http://arxiv.org/abs/2507.05644v1
- Date: Tue, 08 Jul 2025 03:52:48 GMT
- Title: FACT: the Features At Convergence Theorem for neural networks
- Authors: Enric Boix-Adsera, Neil Mallinar, James B. Simon, Mikhail Belkin,
- Abstract summary: We prove the Features at Convergence Theorem (FACT), which gives a self-consistency equation that neural network weights satisfy at convergence when trained with nonzero weight decay.<n>We validate this relation empirically, showing that neural features indeed satisfy the FACT at convergence.<n>By modifying the "Recursive Feature Machines" of Radhakrishnan et al. 2024 so that they obey the FACT, we arrive at a new learning algorithm, FACT-RFM.
- Score: 11.578851413831247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A central challenge in deep learning theory is to understand how neural networks learn and represent features. To this end, we prove the Features at Convergence Theorem (FACT), which gives a self-consistency equation that neural network weights satisfy at convergence when trained with nonzero weight decay. For each weight matrix $W$, this equation relates the "feature matrix" $W^\top W$ to the set of input vectors passed into the matrix during forward propagation and the loss gradients passed through it during backpropagation. We validate this relation empirically, showing that neural features indeed satisfy the FACT at convergence. Furthermore, by modifying the "Recursive Feature Machines" of Radhakrishnan et al. 2024 so that they obey the FACT, we arrive at a new learning algorithm, FACT-RFM. FACT-RFM achieves high performance on tabular data and captures various feature learning behaviors that occur in neural network training, including grokking in modular arithmetic and phase transitions in learning sparse parities.
Related papers
- Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework.<n>We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values.<n>This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z) - Fourier Circuits in Neural Networks and Transformers: A Case Study of Modular Arithmetic with Multiple Inputs [35.212818841550835]
One-hidden layer neural networks and one-layer Transformers are studied.<n>One-hidden layer neural networks attain a maximum $ L_2,k+1 $-margin on a dataset.<n>We observe similar computational mechanisms in attention of one-layer Transformers.
arXiv Detail & Related papers (2024-02-12T05:52:06Z) - On the hardness of learning under symmetries [31.961154082757798]
We study the problem of learning equivariant neural networks via gradient descent.
In spite of the inductive bias via symmetry, actually learning the complete classes of functions represented by equivariant neural networks via gradient descent remains hard.
arXiv Detail & Related papers (2024-01-03T18:24:18Z) - Deep neural networks have an inbuilt Occam's razor [4.330196787806085]
We show that structured data combined with an intrinsic Occam's razor-like inductive bias towards simple functions counteracts the exponential growth of functions with complexity.<n>This analysis reveals that structured data, combined with an intrinsic Occam's razor-like inductive bias towards (Kolmogorov) simple functions that is strong enough to counteract the exponential growth of functions with complexity, is a key to the success of DNNs.
arXiv Detail & Related papers (2023-04-13T16:58:21Z) - Permutation Equivariant Neural Functionals [92.0667671999604]
This work studies the design of neural networks that can process the weights or gradients of other neural networks.
We focus on the permutation symmetries that arise in the weights of deep feedforward networks because hidden layer neurons have no inherent order.
In our experiments, we find that permutation equivariant neural functionals are effective on a diverse set of tasks.
arXiv Detail & Related papers (2023-02-27T18:52:38Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Optimal Approximation with Sparse Neural Networks and Applications [0.0]
We use deep sparsely connected neural networks to measure the complexity of a function class in $L(mathbb Rd)$.
We also introduce representation system - a countable collection of functions to guide neural networks.
We then analyse the complexity of a class called $beta$ cartoon-like functions using rate-distortion theory and wedgelets construction.
arXiv Detail & Related papers (2021-08-14T05:14:13Z) - Connecting Weighted Automata, Tensor Networks and Recurrent Neural
Networks through Spectral Learning [58.14930566993063]
We present connections between three models used in different research fields: weighted finite automata(WFA) from formal languages and linguistics, recurrent neural networks used in machine learning, and tensor networks.
We introduce the first provable learning algorithm for linear 2-RNN defined over sequences of continuous vectors input.
arXiv Detail & Related papers (2020-10-19T15:28:00Z) - Learning compositional functions via multiplicative weight updates [97.9457834009578]
We show that multiplicative weight updates satisfy a descent lemma tailored to compositional functions.
We show that Madam can train state of the art neural network architectures without learning rate tuning.
arXiv Detail & Related papers (2020-06-25T17:05:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.