Latent State Models of Training Dynamics
- URL: http://arxiv.org/abs/2308.09543v3
- Date: Fri, 19 Jan 2024 19:45:37 GMT
- Title: Latent State Models of Training Dynamics
- Authors: Michael Y. Hu, Angelica Chen, Naomi Saphra, Kyunghyun Cho
- Abstract summary: We train models with different random seeds and compute a variety of metrics throughout training.
We then fit a hidden Markov model (HMM) over the resulting sequences of metrics.
We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
- Score: 51.88132043461152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The impact of randomness on model training is poorly understood. How do
differences in data order and initialization actually manifest in the model,
such that some training runs outperform others or converge faster? Furthermore,
how can we interpret the resulting training dynamics and the phase transitions
that characterize different trajectories? To understand the effect of
randomness on the dynamics and outcomes of neural network training, we train
models multiple times with different random seeds and compute a variety of
metrics throughout training, such as the $L_2$ norm, mean, and variance of the
neural network's weights. We then fit a hidden Markov model (HMM) over the
resulting sequences of metrics. The HMM represents training as a stochastic
process of transitions between latent states, providing an intuitive overview
of significant changes during training. Using our method, we produce a
low-dimensional, discrete representation of training dynamics on grokking
tasks, image classification, and masked language modeling. We use the HMM
representation to study phase transitions and identify latent "detour" states
that slow down convergence.
Related papers
- Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Enhancing Neural Training via a Correlated Dynamics Model [2.9302545029880394]
Correlation Mode Decomposition (CMD) is an algorithm that clusters the parameter space into groups, that display synchronized behavior across epochs.
We introduce an efficient CMD variant, designed to run concurrently with training.
Our experiments indicate that CMD surpasses the state-of-the-art method for compactly modeled dynamics on image classification.
arXiv Detail & Related papers (2023-12-20T18:22:49Z) - Identifying Equivalent Training Dynamics [3.793387630509845]
We develop a framework for identifying conjugate and non-conjugate training dynamics.
By leveraging advances in Koopman operator theory, we demonstrate that comparing Koopman eigenvalues can correctly identify a known equivalence between online mirror descent and online gradient descent.
We then utilize our approach to: (a) identify non-conjugate training dynamics between shallow and wide fully connected neural networks; (b) characterize the early phase of training dynamics in convolutional neural networks; (c) uncover non-conjugate training dynamics in Transformers that do and do not undergo grokking.
arXiv Detail & Related papers (2023-02-17T22:15:20Z) - The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task.
We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality.
This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z) - Characterizing and overcoming the greedy nature of learning in
multi-modal deep neural networks [62.48782506095565]
We show that due to the greedy nature of learning in deep neural networks, models tend to rely on just one modality while under-fitting the other modalities.
We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning.
arXiv Detail & Related papers (2022-02-10T20:11:21Z) - Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models.
We propose a simple and effective iterative training method called MIx Source and pseudo Target.
Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z) - Regularized Sequential Latent Variable Models with Adversarial Neural
Networks [33.74611654607262]
We will present different ways of using high level latent random variables in RNN to model the variability in the sequential data.
We will explore possible ways of using adversarial method to train a variational RNN model.
arXiv Detail & Related papers (2021-08-10T08:05:14Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.