L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling
- URL: http://arxiv.org/abs/2503.04725v1
- Date: Thu, 06 Mar 2025 18:59:48 GMT
- Title: L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling
- Authors: Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić,
- Abstract summary: We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies.<n>We formulate the Long-context Language Modeling condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information.
- Score: 5.283885355422517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context language modeling. Using this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of large language models toward longer context lengths.
Related papers
- Language Models Are Implicitly Continuous [5.445513969959226]
We show that Transformer-based language models implicitly learn to represent sentences as continuous-time functions.
This phenomenon occurs in most state-of-the-art Large Language Models (LLMs), including Llama2, Llama3, Phi3, Gemma, Gemma2, and Mistral.
arXiv Detail & Related papers (2025-04-04T21:01:20Z) - Explaining Context Length Scaling and Bounds for Language Models [32.61464977485449]
We propose a theoretical framework on explaining the impact of context length to Language Modeling.<n>We conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions.<n>Our framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain case.
arXiv Detail & Related papers (2025-02-03T16:16:15Z) - Context versus Prior Knowledge in Language Models [49.17879668110546]
Language models often need to integrate prior knowledge learned during pretraining and new information presented in context.
We propose two mutual information-based metrics to measure a model's dependency on a context and on its prior about an entity.
arXiv Detail & Related papers (2024-04-06T13:46:53Z) - On the Scaling Laws of Geographical Representation in Language Models [0.11510009152620666]
We show that geographical knowledge is observable even for tiny models, and that it scales consistently as we increase the model size.
Notably, we observe that larger language models cannot mitigate the geographical bias that is inherent to the training data.
arXiv Detail & Related papers (2024-02-29T18:04:11Z) - Formal Aspects of Language Modeling [74.16212987886013]
Large language models have become one of the most commonly deployed NLP inventions.
These notes are the accompaniment to the theoretical portion of the ETH Z"urich course on large language models.
arXiv Detail & Related papers (2023-11-07T20:21:42Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - Black-box language model explanation by context length probing [7.526153863886609]
We present context length probing, a novel explanation technique for causal language models.
The technique is model-agnostic and does not rely on access to model internals beyond computing token-level probabilities.
We apply context length probing to large pre-trained language models and offer some initial analyses and insights.
arXiv Detail & Related papers (2022-12-30T16:24:10Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.