Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing
- URL: http://arxiv.org/abs/2505.08651v1
- Date: Tue, 13 May 2025 15:13:15 GMT
- Title: Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing
- Authors: Chen Wu, Yin Song,
- Abstract summary: We present MegaBeam-Mistral-7B, a language model that supports 512K-token context length.<n>Our work addresses practical limitations in long-context training, supporting real-world tasks such as compliance monitoring and verification.
- Score: 5.093526177294803
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present MegaBeam-Mistral-7B, a language model that supports 512K-token context length. Our work addresses practical limitations in long-context training, supporting real-world tasks such as compliance monitoring and verification. Evaluated on three long-context benchmarks, our 7B-parameter model demonstrates superior in-context learning performance on HELMET and robust retrieval and tracing capability on RULER. It is currently the only open model to achieve competitive long-range reasoning on BABILong at 512K context length without RAG or targeted fine-tuning. Released as fully open source under the Apache 2.0 license, the model has been downloaded over 100,000 times on Hugging Face. Model available at: https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-512k
Related papers
- Too Long, Didn't Model: Decomposing LLM Long-Context Understanding With Novels [3.537369004801589]
We release the Too Long, Didn't Model benchmark.<n>It tests a model's ability to report plot summary, storyworld configuration, and elapsed narrative time.<n>We find that none of seven tested frontier LLMs retain stable understanding beyond 64k tokens.
arXiv Detail & Related papers (2025-05-20T21:21:09Z) - LongCodeBench: Evaluating Coding LLMs at 1M Context Windows [32.93947506522558]
We identify code comprehension and repair as a natural testbed and challenge task for long-context models.<n>We introduce LongCodeBench, a benchmark to test LLM coding abilities in long-context scenarios.<n>We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet.
arXiv Detail & Related papers (2025-05-12T05:38:03Z) - From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models [54.44375226381814]
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling.<n>We introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens.<n>Our approach achieves state-of-the-art performance across a diverse set of long-context benchmarks.
arXiv Detail & Related papers (2025-04-08T16:58:58Z) - Qwen2.5-1M Technical Report [72.09755998661568]
We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens.<n>By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup.
arXiv Detail & Related papers (2025-01-26T03:47:25Z) - How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information.<n>We find that code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short-context data.<n>Our final model, ProLong-8B, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z) - ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities [53.97515452727115]
ChatQA 2 is a Llama 3.0-based model with a 128K context window.<n>We present a training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens.<n>We find that the performance of strong long-context LLMs using RAG improves when retrieving a larger number of chunks.
arXiv Detail & Related papers (2024-07-19T17:35:47Z) - Scaling Granite Code Models to 128K Context [37.33217431348284]
This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens.
Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraining.
We release all our long-context Granite code models under an Apache 2.0 license for both research and commercial use.
arXiv Detail & Related papers (2024-07-18T17:46:02Z) - Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training.
By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z) - LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models [67.58275666573496]
LongLoRA is an efficient fine-tuning approach that extends the context sizes of pre-trained large language models.
We demonstrate strong empirical results on various tasks on Llama2 models from 7B/13B to 70B.
arXiv Detail & Related papers (2023-09-21T17:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.