Mapping 1,000+ Language Models via the Log-Likelihood Vector
- URL: http://arxiv.org/abs/2502.16173v1
- Date: Sat, 22 Feb 2025 10:23:36 GMT
- Title: Mapping 1,000+ Language Models via the Log-Likelihood Vector
- Authors: Momose Oyama, Hiroaki Yamagiwa, Yusuke Takase, Hidetoshi Shimodaira,
- Abstract summary: We use log-likelihood vectors computed on a predefined text set as model features to compare autoregressive language models at scale.<n>Our method is highly scalable, with computational cost growing linearly in both the number of models and text samples.<n>Applying this method to over 1,000 language models, we constructed a "model map," providing a new perspective on large-scale model analysis.
- Score: 2.5999037208435705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To compare autoregressive language models at scale, we propose using log-likelihood vectors computed on a predefined text set as model features. This approach has a solid theoretical basis: when treated as model coordinates, their squared Euclidean distance approximates the Kullback-Leibler divergence of text-generation probabilities. Our method is highly scalable, with computational cost growing linearly in both the number of models and text samples, and is easy to implement as the required features are derived from cross-entropy loss. Applying this method to over 1,000 language models, we constructed a "model map," providing a new perspective on large-scale model analysis.
Related papers
- Likelihood Variance as Text Importance for Resampling Texts to Map Language Models [2.5999037208435705]
We propose a resampling method that selects important texts with weights proportional to the variance of log-likelihoods across models for each text.<n>Our method significantly reduces the number of required texts while preserving the accuracy of KL divergence estimates.
arXiv Detail & Related papers (2025-05-21T12:10:40Z) - Jet Expansions of Residual Computation [25.842534423280185]
We introduce a framework for expanding residual computational graphs using jets.
Our method provides a systematic approach to disentangle contributions of different computational paths to model predictions.
arXiv Detail & Related papers (2024-10-08T13:25:08Z) - Scalable Inference for Bayesian Multinomial Logistic-Normal Dynamic Linear Models [0.5735035463793009]
This article develops an efficient and accurate approach to posterior state estimation, called $textitFenrir$.
Our experiments suggest that Fenrir can be three orders of magnitude more efficient than Stan.
Our methods are made available to the community as a user-friendly software library written in C++ with an R interface.
arXiv Detail & Related papers (2024-10-07T23:20:14Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - MAUVE Scores for Generative Models: Theory and Practice [95.86006777961182]
We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images.
We find that MAUVE can quantify the gaps between the distributions of human-written text and those of modern neural language models.
We demonstrate in the vision domain that MAUVE can identify known properties of generated images on par with or better than existing metrics.
arXiv Detail & Related papers (2022-12-30T07:37:40Z) - An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws [24.356906682593532]
We study the compute-optimal trade-off between model and training data set sizes for large neural networks.
Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla.
arXiv Detail & Related papers (2022-12-02T18:46:41Z) - Language Model Cascades [72.18809575261498]
Repeated interactions at test-time with a single model, or the composition of multiple models together, further expands capabilities.
Cases with control flow and dynamic structure require techniques from probabilistic programming.
We formalize several existing techniques from this perspective, including scratchpads / chain of thought, verifiers, STaR, selection-inference, and tool use.
arXiv Detail & Related papers (2022-07-21T07:35:18Z) - Low-Rank Constraints for Fast Inference in Structured Models [110.38427965904266]
This work demonstrates a simple approach to reduce the computational and memory complexity of a large class of structured models.
Experiments with neural parameterized structured models for language modeling, polyphonic music modeling, unsupervised grammar induction, and video modeling show that our approach matches the accuracy of standard models at large state spaces.
arXiv Detail & Related papers (2022-01-08T00:47:50Z) - Transformer-based Map Matching Model with Limited Ground-Truth Data
using Transfer-Learning Approach [6.510061176722248]
In many trajectory-based applications, it is necessary to map raw GPS trajectories onto road networks in digital maps.
In this paper, we consider the map-matching task from the data perspective, proposing a deep learning-based map-matching model.
We generate synthetic trajectory data to pre-train the Transformer model and then fine-tune the model with a limited number of ground-truth data.
arXiv Detail & Related papers (2021-08-01T11:51:11Z) - Distilling Interpretable Models into Human-Readable Code [71.11328360614479]
Human-readability is an important and desirable standard for machine-learned model interpretability.
We propose to train interpretable models using conventional methods, and then distill them into concise, human-readable code.
We describe a piecewise-linear curve-fitting algorithm that produces high-quality results efficiently and reliably across a broad range of use cases.
arXiv Detail & Related papers (2021-01-21T01:46:36Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z) - Learning Gaussian Graphical Models via Multiplicative Weights [54.252053139374205]
We adapt an algorithm of Klivans and Meka based on the method of multiplicative weight updates.
The algorithm enjoys a sample complexity bound that is qualitatively similar to others in the literature.
It has a low runtime $O(mp2)$ in the case of $m$ samples and $p$ nodes, and can trivially be implemented in an online manner.
arXiv Detail & Related papers (2020-02-20T10:50:58Z) - Predicting Multidimensional Data via Tensor Learning [0.0]
We develop a model that retains the intrinsic multidimensional structure of the dataset.
To estimate the model parameters, an Alternating Least Squares algorithm is developed.
The proposed model is able to outperform benchmark models present in the forecasting literature.
arXiv Detail & Related papers (2020-02-11T11:57:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.