Revealing Language Model Trajectories via Kullback-Leibler Divergence
- URL: http://arxiv.org/abs/2505.15353v1
- Date: Wed, 21 May 2025 10:27:54 GMT
- Title: Revealing Language Model Trajectories via Kullback-Leibler Divergence
- Authors: Ryo Kishino, Yusuke Takase, Momose Oyama, Hiroaki Yamagiwa, Hidetoshi Shimodaira,
- Abstract summary: We show that trajectories of language models, as measured by KL divergence, exhibit a spiral structure during pretraining and thread-like progressions across layers.<n>In terms of diffusion exponents, model trajectories in the log-likelihood space are more constrained than those in weight space.
- Score: 2.4233709516962785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recently proposed method enables efficient estimation of the KL divergence between language models, including models with different architectures, by assigning coordinates based on log-likelihood vectors. To better understand the behavior of this metric, we systematically evaluate KL divergence across a wide range of conditions using publicly available language models. Our analysis covers comparisons between pretraining checkpoints, fine-tuned and base models, and layers via the logit lens. We find that trajectories of language models, as measured by KL divergence, exhibit a spiral structure during pretraining and thread-like progressions across layers. Furthermore, we show that, in terms of diffusion exponents, model trajectories in the log-likelihood space are more constrained than those in weight space.
Related papers
- Self-supervised Latent Space Optimization with Nebula Variational Coding [87.20343320266215]
This paper proposes a variational inference model which leads to a clustered embedding.<n>We introduce additional variables in the latent space, called textbfnebula anchors, that guide the latent variables to form clusters during training.<n>Since each latent feature can be labeled with the closest anchor, we also propose to apply metric learning in a self-supervised way to make the separation between clusters more explicit.
arXiv Detail & Related papers (2025-06-02T08:13:32Z) - A Convergence Theory for Diffusion Language Models: An Information-Theoretic Perspective [8.15094483029656]
Diffusion models enable parallel token sampling, leading to faster generation and eliminating left-to-right generation constraints.<n>We develop convergence guarantees for diffusion language models from an information-theoretic perspective.<n>These results offer novel theoretical insights into the practical effectiveness of diffusion language models.
arXiv Detail & Related papers (2025-05-27T16:24:20Z) - Better Estimation of the KL Divergence Between Language Models [58.7977683502207]
Estimating the Kullback--Leibler (KL) divergence between language models has many applications.<n>We introduce a Rao--Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator.
arXiv Detail & Related papers (2025-04-14T18:40:02Z) - Mapping 1,000+ Language Models via the Log-Likelihood Vector [2.5999037208435705]
We use log-likelihood vectors computed on a predefined text set as model features to compare autoregressive language models at scale.<n>Our method is highly scalable, with computational cost growing linearly in both the number of models and text samples.<n>Applying this method to over 1,000 language models, we constructed a "model map," providing a new perspective on large-scale model analysis.
arXiv Detail & Related papers (2025-02-22T10:23:36Z) - Scalable Language Models with Posterior Inference of Latent Thought Vectors [52.63299874322121]
Latent-Thought Language Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.<n>LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space.<n>LTMs significantly outperform conventional autoregressive models and discrete diffusion models in validation perplexity and zero-shot language modeling.
arXiv Detail & Related papers (2025-02-03T17:50:34Z) - MGF: Mixed Gaussian Flow for Diverse Trajectory Prediction [72.70572835589158]
We propose constructing a mixed Gaussian prior for a normalizing flow model for trajectory prediction.<n>Our method achieves state-of-the-art performance in the evaluation of both trajectory alignment and diversity on the popular UCY/ETH and SDD datasets.
arXiv Detail & Related papers (2024-02-19T15:48:55Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Hierarchical Latent Structure for Multi-Modal Vehicle Trajectory
Forecasting [0.0]
We introduce a hierarchical latent structure into a VAE-based trajectory forecasting model.
Our model is capable of generating clear multi-modal trajectory distributions and outperforms the state-of-the-art (SOTA) models in terms of prediction accuracy.
arXiv Detail & Related papers (2022-07-11T04:52:28Z) - Extracting Latent Steering Vectors from Pretrained Language Models [14.77762401765532]
We show that latent vectors can be extracted directly from language model decoders without fine-tuning.
Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly.
We find that distances between steering vectors reflect sentence similarity when evaluated on a textual similarity benchmark.
arXiv Detail & Related papers (2022-05-10T19:04:37Z) - Low-Rank Constraints for Fast Inference in Structured Models [110.38427965904266]
This work demonstrates a simple approach to reduce the computational and memory complexity of a large class of structured models.
Experiments with neural parameterized structured models for language modeling, polyphonic music modeling, unsupervised grammar induction, and video modeling show that our approach matches the accuracy of standard models at large state spaces.
arXiv Detail & Related papers (2022-01-08T00:47:50Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z) - Isometric Gaussian Process Latent Variable Model for Dissimilarity Data [0.0]
We present a probabilistic model where the latent variable respects both the distances and the topology of the modeled data.
The model is inferred by variational inference based on observations of pairwise distances.
arXiv Detail & Related papers (2020-06-21T08:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.