Emergence in non-neural models: grokking modular arithmetic via average gradient outer product
- URL: http://arxiv.org/abs/2407.20199v2
- Date: Fri, 18 Oct 2024 04:13:15 GMT
- Title: Emergence in non-neural models: grokking modular arithmetic via average gradient outer product
- Authors: Neil Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin,
- Abstract summary: We show that grokking is not specific to neural networks nor to gradient descent-based optimization.
We show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines.
Our results demonstrate that emergence can result purely from learning task-relevant features.
- Score: 16.911836722312152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.
Related papers
- Linearization Turns Neural Operators into Function-Valued Gaussian Processes [23.85470417458593]
We introduce a new framework for approximate Bayesian uncertainty quantification in neural operators.
Our approach can be interpreted as a probabilistic analogue of the concept of currying from functional programming.
We showcase the efficacy of our approach through applications to different types of partial differential equations.
arXiv Detail & Related papers (2024-06-07T16:43:54Z) - Neural Network-Based Score Estimation in Diffusion Models: Optimization
and Generalization [12.812942188697326]
Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness.
A key component of these models is to learn the score function through score matching.
Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy.
arXiv Detail & Related papers (2024-01-28T08:13:56Z) - Linear Recursive Feature Machines provably recover low-rank matrices [17.530511273384786]
We develop the first theoretical guarantees for how RFM performs dimensionality reduction.
We generalize the Iteratively Reweighted Least Squares (IRLS) algorithm.
Our results shed light on the connection between feature learning in neural networks and classical sparse recovery algorithms.
arXiv Detail & Related papers (2024-01-09T13:44:12Z) - Nonlinear functional regression by functional deep neural network with
kernel embedding [20.306390874610635]
We propose a functional deep neural network with an efficient and fully data-dependent dimension reduction method.
The architecture of our functional net consists of a kernel embedding step, a projection step, and a deep ReLU neural network for the prediction.
The utilization of smooth kernel embedding enables our functional net to be discretization invariant, efficient, and robust to noisy observations.
arXiv Detail & Related papers (2024-01-05T16:43:39Z) - Diffusion-Model-Assisted Supervised Learning of Generative Models for
Density Estimation [10.793646707711442]
We present a framework for training generative models for density estimation.
We use the score-based diffusion model to generate labeled data.
Once the labeled data are generated, we can train a simple fully connected neural network to learn the generative model in the supervised manner.
arXiv Detail & Related papers (2023-10-22T23:56:19Z) - Permutation Equivariant Neural Functionals [92.0667671999604]
This work studies the design of neural networks that can process the weights or gradients of other neural networks.
We focus on the permutation symmetries that arise in the weights of deep feedforward networks because hidden layer neurons have no inherent order.
In our experiments, we find that permutation equivariant neural functionals are effective on a diverse set of tasks.
arXiv Detail & Related papers (2023-02-27T18:52:38Z) - A Recursively Recurrent Neural Network (R2N2) Architecture for Learning
Iterative Algorithms [64.3064050603721]
We generalize Runge-Kutta neural network to a recurrent neural network (R2N2) superstructure for the design of customized iterative algorithms.
We demonstrate that regular training of the weight parameters inside the proposed superstructure on input/output data of various computational problem classes yields similar iterations to Krylov solvers for linear equation systems, Newton-Krylov solvers for nonlinear equation systems, and Runge-Kutta solvers for ordinary differential equations.
arXiv Detail & Related papers (2022-11-22T16:30:33Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Inducing Gaussian Process Networks [80.40892394020797]
We propose inducing Gaussian process networks (IGN), a simple framework for simultaneously learning the feature space as well as the inducing points.
The inducing points, in particular, are learned directly in the feature space, enabling a seamless representation of complex structured domains.
We report on experimental results for real-world data sets showing that IGNs provide significant advances over state-of-the-art methods.
arXiv Detail & Related papers (2022-04-21T05:27:09Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Training Feedback Spiking Neural Networks by Implicit Differentiation on
the Equilibrium State [66.2457134675891]
Spiking neural networks (SNNs) are brain-inspired models that enable energy-efficient implementation on neuromorphic hardware.
Most existing methods imitate the backpropagation framework and feedforward architectures for artificial neural networks.
We propose a novel training method that does not rely on the exact reverse of the forward computation.
arXiv Detail & Related papers (2021-09-29T07:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.