Dynamical versus Bayesian Phase Transitions in a Toy Model of
Superposition
- URL: http://arxiv.org/abs/2310.06301v1
- Date: Tue, 10 Oct 2023 04:26:04 GMT
- Title: Dynamical versus Bayesian Phase Transitions in a Toy Model of
Superposition
- Authors: Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, Daniel Murfet
- Abstract summary: We investigate phase transitions in a Toy Model of Superposition (TMS) using Singular Learning Theory (SLT)
We present supporting theory indicating that the local learning coefficient determines phase transitions in the Bayesian posterior as a function of training sample size.
The picture that emerges adds evidence to the conjecture that the SGD learning trajectory is subject to a sequential learning mechanism.
- Score: 2.3249139042158853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate phase transitions in a Toy Model of Superposition (TMS) using
Singular Learning Theory (SLT). We derive a closed formula for the theoretical
loss and, in the case of two hidden dimensions, discover that regular $k$-gons
are critical points. We present supporting theory indicating that the local
learning coefficient (a geometric invariant) of these $k$-gons determines phase
transitions in the Bayesian posterior as a function of training sample size. We
then show empirically that the same $k$-gon critical points also determine the
behavior of SGD training. The picture that emerges adds evidence to the
conjecture that the SGD learning trajectory is subject to a sequential learning
mechanism. Specifically, we find that the learning process in TMS, be it
through SGD or Bayesian learning, can be characterized by a journey through
parameter space from regions of high loss and low complexity to regions of low
loss and high complexity.
Related papers
- Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks [0.0]
We investigate the structure of learning dynamics in transformer models through carefully controlled arithmetic tasks.<n>Our results suggest a unifying geometric framework for understanding transformer learning.
arXiv Detail & Related papers (2026-02-11T03:57:46Z) - Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds [0.5729426778193398]
We study the emergence of multi-step reasoning in deep Transformer language models through a geometric and statistical-physics lens.<n>We formalize the forward pass as a discrete coarse-graining map and relate the appearance of stable "concept basins" to fixed points of this renormalization-like dynamics.<n>The resulting low-entropy regime is characterized by a spectral tail collapse and by the formation of transient, reusable object-like structures in representation space.
arXiv Detail & Related papers (2026-01-16T23:11:02Z) - Phase transitions reveal hierarchical structure in deep neural networks [0.0]
We show that phase transitions in Deep Neural Networks are governed by saddle points in the loss landscape.<n>We introduce a simple, fast, and easy to implement algorithm that uses the L2 regularizer as a tool to probe the geometry of error landscapes.
arXiv Detail & Related papers (2025-12-05T15:14:09Z) - Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points [17.339704162468042]
We investigate the loss landscape of transformer models trained on in-context next-token prediction tasks.<n>In particular, we focus on learning in-context $n$-gram language models under cross-entropy loss.
arXiv Detail & Related papers (2025-08-18T11:24:30Z) - On Learning Latent Models with Multi-Instance Weak Supervision [57.18649648182171]
We consider a weakly supervised learning scenario where the supervision signal is generated by a transition function $sigma$ labels associated with multiple input instances.
Our problem is met in different fields, including latent structural learning and neuro-symbolic integration.
arXiv Detail & Related papers (2023-06-23T22:05:08Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Hidden Progress in Deep Learning: SGD Learns Parities Near the
Computational Limit [36.17720004582283]
This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits.
We find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time.
arXiv Detail & Related papers (2022-07-18T17:55:05Z) - Transfer learning of phase transitions in percolation and directed
percolation [2.0342076109301583]
We apply domain adversarial neural network (DANN) based on transfer learning to studying non-equilibrium and equilibrium phase transition models.
The DANN learning of both models yields reliable results which are comparable to the ones from Monte Carlo simulations.
arXiv Detail & Related papers (2021-12-31T15:24:09Z) - Model based Multi-agent Reinforcement Learning with Tensor
Decompositions [52.575433758866936]
This paper investigates generalisation in state-action space over unexplored state-action pairs by modelling the transition and reward functions as tensors of low CP-rank.
Experiments on synthetic MDPs show that using tensor decompositions in a model-based reinforcement learning algorithm can lead to much faster convergence if the true transition and reward functions are indeed of low rank.
arXiv Detail & Related papers (2021-10-27T15:36:25Z) - Understanding Long Range Memory Effects in Deep Neural Networks [10.616643031188248]
textitstochastic gradient descent (SGD) is of fundamental importance in deep learning.
In this study, we argue that SGN is neither Gaussian nor stable. Instead, we propose that SGD can be viewed as a discretization of an SDE driven by textitfractional Brownian motion (FBM)
arXiv Detail & Related papers (2021-05-05T13:54:26Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - The Impact of the Mini-batch Size on the Variance of Gradients in
Stochastic Gradient Descent [28.148743710421932]
The mini-batch gradient descent (SGD) algorithm is widely used in training machine learning models.
We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks.
arXiv Detail & Related papers (2020-04-27T20:06:11Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.