Related papers: Controlling Grokking with Nonlinearity and Data Symmetry

Controlling Grokking with Nonlinearity and Data Symmetry

URL: http://arxiv.org/abs/2411.05353v1
Date: Fri, 08 Nov 2024 06:19:29 GMT
Title: Controlling Grokking with Nonlinearity and Data Symmetry
Authors: Ahmed Salah, David Yevick,
Abstract summary: Plotting the even PCA projections of the weights of the last NN layer against their odd projections yields patterns which become significantly more uniform when the nonlinearity is increased. A metric for the generalization ability of the network is inferred from the entropy of the layer weights while the degree of nonlinearity is related to correlations between the local entropy of the weights of the neurons in the final layer.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper demonstrates that grokking behavior in modular arithmetic with a modulus P in a neural network can be controlled by modifying the profile of the activation function as well as the depth and width of the model. Plotting the even PCA projections of the weights of the last NN layer against their odd projections further yields patterns which become significantly more uniform when the nonlinearity is increased by incrementing the number of layers. These patterns can be employed to factor P when P is nonprime. Finally, a metric for the generalization ability of the network is inferred from the entropy of the layer weights while the degree of nonlinearity is related to correlations between the local entropy of the weights of the neurons in the final layer.

Related papers

TGPT-PINN: Nonlinear model reduction with transformed GPT-PINNs [1.6093211760643649]
We introduce the Transformed Generative Pre-Trained Physics-Informed Neural Networks (TGPT-PINN) TGPT-PINN is a network-of-networks design achieving snapshot-based model reduction. We demonstrate this new capability for nonlinear model reduction in the PINNs framework by several non-trivial partial differential equations.
arXiv Detail & Related papers (2024-03-06T04:49:18Z)
Layered Models can "Automatically" Regularize and Discover Low-Dimensional Structures via Feature Learning [6.109362130047454]
We study a two-layer nonparametric regression model where the input undergoes a linear transformation followed by a nonlinear mapping to predict the output. We show that the two-layer model can "automatically" induce regularization and facilitate feature learning.
arXiv Detail & Related papers (2023-10-18T06:15:35Z)
Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We show that linear networks make provably optimal predictions at infinite depth. We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z)
Nonlinear proper orthogonal decomposition for convection-dominated flows [0.0]
We propose an end-to-end Galerkin-free model combining autoencoders with long short-term memory networks for dynamics. Our approach not only improves the accuracy, but also significantly reduces the computational cost of training and testing.
arXiv Detail & Related papers (2021-10-15T18:05:34Z)
Learning Nonlinear Waves in Plasmon-induced Transparency [0.0]
We consider a recurrent neural network (RNN) approach to predict the complex propagation of nonlinear solitons in plasmon-induced transparency metamaterial systems. We prove the prominent agreement of results in simulation and prediction by long short-term memory (LSTM) artificial neural networks.
arXiv Detail & Related papers (2021-07-31T21:21:44Z)
Lower Bounds on the Generalization Error of Nonlinear Learning Models [2.1030878979833467]
We study in this paper lower bounds for the generalization error of models derived from multi-layer neural networks, in the regime where the size of the layers is commensurate with the number of samples in the training data. We show that unbiased estimators have unacceptable performance for such nonlinear networks in this regime. We derive explicit generalization lower bounds for general biased estimators, in the cases of linear regression and of two-layered networks.
arXiv Detail & Related papers (2021-03-26T20:37:54Z)
Going beyond p-convolutions to learn grayscale morphological operators [64.38361575778237]
We present two new morphological layers based on the same principle as the p-convolutional layer. In this work, we present two new morphological layers based on the same principle as the p-convolutional layer.
arXiv Detail & Related papers (2021-02-19T17:22:16Z)
Non-intrusive reduced order modeling of poroelasticity of heterogeneous media based on a discontinuous Galerkin approximation [0.0]
We present a non-intrusive model reduction framework for linear poroelasticity problems in heterogeneous porous media. We utilize the interior penalty discontinuous Galerkin (DG) method as a full order solver to handle discontinuity. We show that our framework provides reasonable approximations of the DG solution, but it is significantly faster.
arXiv Detail & Related papers (2021-01-28T04:21:06Z)
From deep to Shallow: Equivalent Forms of Deep Networks in Reproducing Kernel Krein Space and Indefinite Support Vector Machines [63.011641517977644]
We take a deep network and convert it to an equivalent (indefinite) kernel machine. We then investigate the implications of this transformation for capacity control and uniform convergence. Finally, we analyse the sparsity properties of the flat representation, showing that the flat weights are (effectively) Lp-"norm" regularised with 0p1.
arXiv Detail & Related papers (2020-07-15T03:21:35Z)
Exponentially Weighted l_2 Regularization Strategy in Constructing Reinforced Second-order Fuzzy Rule-based Model [72.57056258027336]
In the conventional Takagi-Sugeno-Kang (TSK)-type fuzzy models, constant or linear functions are usually utilized as the consequent parts of the fuzzy rules. We introduce an exponential weight approach inspired by the weight function theory encountered in harmonic analysis.
arXiv Detail & Related papers (2020-07-02T15:42:15Z)
Multipole Graph Neural Operator for Parametric Partial Differential Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data. We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity. Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z)
Supervised Learning for Non-Sequential Data: A Canonical Polyadic Decomposition Approach [85.12934750565971]
Efficient modelling of feature interactions underpins supervised learning for non-sequential tasks. To alleviate this issue, it has been proposed to implicitly represent the model parameters as a tensor. For enhanced expressiveness, we generalize the framework to allow feature mapping to arbitrarily high-dimensional feature vectors.
arXiv Detail & Related papers (2020-01-27T22:38:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.