Learning Associative Memories with Gradient Descent
- URL: http://arxiv.org/abs/2402.18724v1
- Date: Wed, 28 Feb 2024 21:47:30 GMT
- Title: Learning Associative Memories with Gradient Descent
- Authors: Vivien Cabannes, Berfin Simsek, Alberto Bietti
- Abstract summary: This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings.
We show that imbalance in token frequencies and memory interferences due to correlated embeddings lead to transitory regimes.
- Score: 21.182801606213495
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This work focuses on the training dynamics of one associative memory module
storing outer products of token embeddings. We reduce this problem to the study
of a system of particles, which interact according to properties of the data
distribution and correlations between embeddings. Through theory and
experiments, we provide several insights. In overparameterized regimes, we
obtain logarithmic growth of the ``classification margins.'' Yet, we show that
imbalance in token frequencies and memory interferences due to correlated
embeddings lead to oscillatory transitory regimes. The oscillations are more
pronounced with large step sizes, which can create benign loss spikes, although
these learning rates speed up the dynamics and accelerate the asymptotic
convergence. In underparameterized regimes, we illustrate how the cross-entropy
loss can lead to suboptimal memorization schemes. Finally, we assess the
validity of our findings on small Transformer models.
Related papers
- Effects of Feature Correlations on Associative Memory Capacity [1.024113475677323]
We develop an empirical framework to analyze the effects of data structure on capacity dynamics.<n>Experiments confirm that memory capacity scales exponentially with increasing separation in the input space.<n>Our findings bridge theoretical work and practical settings for DAM, and might inspire more data-centric methods.
arXiv Detail & Related papers (2025-08-02T15:03:01Z) - Higher-Order Kuramoto Oscillator Network for Dense Associative Memory [0.0]
We show that higher-order couplings achieve superlinear scaling of memory capacity with system size.<n>These results bridge the Kuramoto synchronization with modern Hopfield memories.
arXiv Detail & Related papers (2025-07-29T16:35:52Z) - The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions [51.68215326304272]
We show that even small perturbations reliably cause otherwise identical training trajectories to diverge-an effect that diminishes rapidly over training time.<n>Our findings provide insights into neural network training stability, with practical implications for fine-tuning, model merging, and diversity of model ensembles.
arXiv Detail & Related papers (2025-06-16T08:35:16Z) - Extending Memorization Dynamics in Pythia Models from Instance-Level Insights [8.476099189609565]
This paper presents a detailed analysis of memorization in the Pythia model family across varying scales and training steps.<n>Using granular metrics, we examine how model architecture, data characteristics, and perturbations influence memorization patterns.
arXiv Detail & Related papers (2025-06-14T03:02:42Z) - Information Dynamics in Quantum Harmonic Systems: Insights from Toy Models [0.0]
This study explores quantum information dynamics using a toy model of coupled harmonic oscillators.<n>We examine how variations in coupling strength, detuning and external factors, such as a magnetic field, influence information flow and computational metrics.<n>In the context of ion transport, we compare sudden and adiabatic protocols, quantifying their fidelity-complexity through a nonadiabaticity metric.
arXiv Detail & Related papers (2025-01-24T09:47:13Z) - A Data-Driven Framework for Discovering Fractional Differential Equations in Complex Systems [8.206685537936078]
This study introduces a stepwise data-driven framework for discovering fractional differential equations (FDEs) directly from data.
Our framework applies deep neural networks as surrogate models for denoising and reconstructing sparse and noisy observations.
We validate the framework across various datasets, including synthetic anomalous diffusion data and experimental data on the creep behavior of frozen soils.
arXiv Detail & Related papers (2024-12-05T08:38:30Z) - Controllable Relation Disentanglement for Few-Shot Class-Incremental Learning [82.79371269942146]
We propose to tackle FewShot Class-Incremental Learning (FSCIL) from a new perspective, i.e., relation disentanglement.
The challenge of disentangling spurious correlations lies in the poor controllability of FSCIL.
We propose a new simple-yeteffective method, called ConTrollable Relation-disentang FewShot Class-Incremental Learning (CTRL-FSCIL)
arXiv Detail & Related papers (2024-03-17T03:16:59Z) - Dynamical signatures of non-Markovianity in a dissipative-driven qubit [0.0]
We investigate signatures of non-Markovianity in the dynamics of a periodically-driven qubit coupled to a bosonic environment.
Non-Markovian features are quantified by comparing on an equal footing the predictions from diverse and complementary approaches to quantum dissipation.
arXiv Detail & Related papers (2024-01-17T15:58:50Z) - Unraveling the Temporal Dynamics of the Unet in Diffusion Models [33.326244121918634]
Diffusion models introduce Gaussian noise into training data and reconstruct the original data iteratively.
Central to this iterative process is a single Unet, adapting across time steps to facilitate generation.
Recent work revealed the presence of composition and denoising phases in this generation process.
arXiv Detail & Related papers (2023-12-17T04:40:33Z) - Dissipative Dynamics of Graph-State Stabilizers with Superconducting
Qubits [0.0]
We study the noisy evolution of multipartite entangled states, focusing on superconducting-qubit devices accessible via the cloud.
We introduce an approach modeling the charge-parity splitting using an extended Markovian environment.
We show that the underlying many-body dynamics generate decays and revivals of stabilizers, which are used extensively in the context of quantum error correction.
arXiv Detail & Related papers (2023-08-03T16:30:35Z) - Loss Dynamics of Temporal Difference Reinforcement Learning [36.772501199987076]
We study the case learning curves for temporal difference learning of a value function with linear function approximators.
We study how learning dynamics and plateaus depend on feature structure, learning rate, discount factor, and reward function.
arXiv Detail & Related papers (2023-07-10T18:17:50Z) - Decimation technique for open quantum systems: a case study with
driven-dissipative bosonic chains [62.997667081978825]
Unavoidable coupling of quantum systems to external degrees of freedom leads to dissipative (non-unitary) dynamics.
We introduce a method to deal with these systems based on the calculation of (dissipative) lattice Green's function.
We illustrate the power of this method with several examples of driven-dissipative bosonic chains of increasing complexity.
arXiv Detail & Related papers (2022-02-15T19:00:09Z) - Continuous and time-discrete non-Markovian system-reservoir
interactions: Dissipative coherent quantum feedback in Liouville space [62.997667081978825]
We investigate a quantum system simultaneously exposed to two structured reservoirs.
We employ a numerically exact quasi-2D tensor network combining both diagonal and off-diagonal system-reservoir interactions with a twofold memory for continuous and discrete retardation effects.
As a possible example, we study the non-Markovian interplay between discrete photonic feedback and structured acoustic phononovian modes, resulting in emerging inter-reservoir correlations and long-living population trapping within an initially-excited two-level system.
arXiv Detail & Related papers (2020-11-10T12:38:35Z) - Extreme Memorization via Scale of Initialization [72.78162454173803]
We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD.
We find that the extent and manner in which generalization ability is affected depends on the activation and loss function used.
In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function.
arXiv Detail & Related papers (2020-08-31T04:53:11Z) - Memory kernel and divisibility of Gaussian Collisional Models [0.0]
Memory effects in the dynamics of open systems have been the subject of significant interest in the last decades.
We analyze two types of interactions, a beam-splitter implementing a partial SWAP and a two-mode squeezing, which entangles the ancillas and feeds excitations into the system.
By analyzing the memory kernel and divisibility for these two representative scenarios, our results help to shed light on the intricate mechanisms behind memory effects in the quantum domain.
arXiv Detail & Related papers (2020-08-03T10:28:55Z) - Untangling tradeoffs between recurrence and self-attention in neural
networks [81.30894993852813]
We present a formal analysis of how self-attention affects gradient propagation in recurrent networks.
We prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies.
We propose a relevancy screening mechanism that allows for a scalable use of sparse self-attention with recurrence.
arXiv Detail & Related papers (2020-06-16T19:24:25Z) - Optimal Learning with Excitatory and Inhibitory synapses [91.3755431537592]
I study the problem of storing associations between analog signals in the presence of correlations.
I characterize the typical learning performance in terms of the power spectrum of random input and output processes.
arXiv Detail & Related papers (2020-05-25T18:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.