L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling
- URL: http://arxiv.org/abs/2503.04725v1
- Date: Thu, 06 Mar 2025 18:59:48 GMT
- Title: L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling
- Authors: Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić,
- Abstract summary: We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies.<n>We formulate the Long-context Language Modeling condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information.
- Score: 5.283885355422517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context language modeling. Using this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of large language models toward longer context lengths.
Related papers
- Personality Prediction from Life Stories using Language Models [12.851871085845499]
In this study, we address the challenge of modeling long narrative interview where each exceeds 2000 tokens so as to predict Five-Factor Model (FFM) personality traits.<n>We propose a two-step approach: first, we extract contextual embeddings using sliding-window fine-tuning of pretrained language models; then, we apply Recurrent Neural Networks (RNNs) with attention mechanisms to integrate long-range dependencies and enhance interpretability.
arXiv Detail & Related papers (2025-06-24T02:39:06Z) - Language Models Are Implicitly Continuous [5.445513969959226]
We show that Transformer-based language models implicitly learn to represent sentences as continuous-time functions.
This phenomenon occurs in most state-of-the-art Large Language Models (LLMs), including Llama2, Llama3, Phi3, Gemma, Gemma2, and Mistral.
arXiv Detail & Related papers (2025-04-04T21:01:20Z) - Explaining Context Length Scaling and Bounds for Language Models [32.61464977485449]
We propose a theoretical framework on explaining the impact of context length to Language Modeling.<n>We conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions.<n>Our framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain case.
arXiv Detail & Related papers (2025-02-03T16:16:15Z) - Stuffed Mamba: Oversized States Lead to the Inability to Forget [69.36377985746878]
We show that Mamba-based models struggle to effectively forget earlier tokens even with built-in forgetting mechanisms.<n>We show that the minimum training length required for the model to learn forgetting scales linearly with the state size, and the maximum context length for accurate retrieval of a 5-digit passkey scales exponentially with the state size.<n>Our work suggests that future RNN designs must account for the interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.
arXiv Detail & Related papers (2024-10-09T17:54:28Z) - Context versus Prior Knowledge in Language Models [49.17879668110546]
Language models often need to integrate prior knowledge learned during pretraining and new information presented in context.
We propose two mutual information-based metrics to measure a model's dependency on a context and on its prior about an entity.
arXiv Detail & Related papers (2024-04-06T13:46:53Z) - On the Scaling Laws of Geographical Representation in Language Models [0.11510009152620666]
We show that geographical knowledge is observable even for tiny models, and that it scales consistently as we increase the model size.
Notably, we observe that larger language models cannot mitigate the geographical bias that is inherent to the training data.
arXiv Detail & Related papers (2024-02-29T18:04:11Z) - Formal Aspects of Language Modeling [74.16212987886013]
Large language models have become one of the most commonly deployed NLP inventions.
These notes are the accompaniment to the theoretical portion of the ETH Z"urich course on large language models.
arXiv Detail & Related papers (2023-11-07T20:21:42Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - A Survey on Long Text Modeling with Transformers [106.50471784909212]
We provide an overview of the recent advances on long texts modeling based on Transformer models.<n>We discuss how to process long input to satisfy the length limitation and design improved Transformer architectures.<n>We describe four typical applications involving long text modeling and conclude this paper with a discussion of future directions.
arXiv Detail & Related papers (2023-02-28T11:34:30Z) - Black-box language model explanation by context length probing [7.526153863886609]
We present context length probing, a novel explanation technique for causal language models.
The technique is model-agnostic and does not rely on access to model internals beyond computing token-level probabilities.
We apply context length probing to large pre-trained language models and offer some initial analyses and insights.
arXiv Detail & Related papers (2022-12-30T16:24:10Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.