Explaining Context Length Scaling and Bounds for Language Models
- URL: http://arxiv.org/abs/2502.01481v2
- Date: Sun, 09 Feb 2025 09:51:56 GMT
- Title: Explaining Context Length Scaling and Bounds for Language Models
- Authors: Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng-Neng Hwang, Serge Belongie, Lei Li,
- Abstract summary: We propose a theoretical framework on explaining the impact of context length to Language Modeling.<n>We conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions.<n>Our framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain case.
- Score: 32.61464977485449
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding on how long context impact Language Modeling. In this work, we (1) propose a clean and effective theoretical framework on explaining the impact of context length to Language Modeling, from an Intrinsic Space perspective; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain case. We hope our work may inspire new long context Language Models, as well as future work studying Physics for Language Models. Code for our experiments is available at this url: https://github.com/JingzheShi/NLPCtlScalingAndBounds.
Related papers
- L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling [5.283885355422517]
We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies.
We formulate the Long-context Language Modeling condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information.
arXiv Detail & Related papers (2025-03-06T18:59:48Z) - Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning [103.65680870130839]
We investigate how to design instruction data for the post-training phase of a long context pre-trained model.
Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones.
Based on these findings, we propose context synthesis, a novel data synthesis framework.
arXiv Detail & Related papers (2025-02-21T17:02:40Z) - In-Context Learning (and Unlearning) of Length Biases [19.740652268957522]
We show that models learn length biases in the context window for their predictions.
We further empirically analyze the factors that modulate the level of bias exhibited by the model.
This reveals the power of in-context learning in debiasing model prediction behaviors without the need for costly parameter updates.
arXiv Detail & Related papers (2025-02-10T16:43:32Z) - Towards a theory of how the structure of language is acquired by deep neural networks [6.363756171493383]
We use a tree-like generative model that captures many of the hierarchical structures found in natural languages.
We show that token-token correlations can be used to build a representation of the grammar's hidden variables.
We conjecture that the relationship between training set size and effective range of correlations holds beyond our synthetic datasets.
arXiv Detail & Related papers (2024-05-28T17:01:22Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models [57.12888828853409]
RAVEN is a model that combines retrieval-augmented masked language modeling and prefix language modeling.
Fusion-in-Context Learning enables the model to leverage more in-context examples without requiring additional training.
Our work underscores the potential of retrieval-augmented encoder-decoder language models for in-context learning.
arXiv Detail & Related papers (2023-08-15T17:59:18Z) - Lost in the Middle: How Language Models Use Long Contexts [88.78803442320246]
We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts.
We find that performance can degrade significantly when changing the position of relevant information.
Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.
arXiv Detail & Related papers (2023-07-06T17:54:11Z) - Black-box language model explanation by context length probing [7.526153863886609]
We present context length probing, a novel explanation technique for causal language models.
The technique is model-agnostic and does not rely on access to model internals beyond computing token-level probabilities.
We apply context length probing to large pre-trained language models and offer some initial analyses and insights.
arXiv Detail & Related papers (2022-12-30T16:24:10Z) - How Far are We from Effective Context Modeling? An Exploratory Study on
Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it.
We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.