Scaling Granite Code Models to 128K Context
- URL: http://arxiv.org/abs/2407.13739v1
- Date: Thu, 18 Jul 2024 17:46:02 GMT
- Title: Scaling Granite Code Models to 128K Context
- Authors: Matt Stallone, Vaibhav Saxena, Leonid Karlinsky, Bridget McGinn, Tim Bula, Mayank Mishra, Adriana Meza Soria, Gaoyuan Zhang, Aditya Prasad, Yikang Shen, Saptha Surendran, Shanmukha Guttula, Hima Patel, Parameswaran Selvam, Xuan-Hong Dang, Yan Koyfman, Atin Sood, Rogerio Feris, Nirmit Desai, David D. Cox, Ruchir Puri, Rameswar Panda,
- Abstract summary: This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens.
Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraining.
We release all our long-context Granite code models under an Apache 2.0 license for both research and commercial use.
- Score: 37.33217431348284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraining by gradually increasing its RoPE base frequency with repository-level file packing and length-upsampled long-context data. Additionally, we also release instruction-tuned models with long-context support which are derived by further finetuning the long context base models on a mix of permissively licensed short and long-context instruction-response pairs. While comparing to the original short-context Granite code models, our long-context models achieve significant improvements on long-context tasks without any noticeable performance degradation on regular code completion benchmarks (e.g., HumanEval). We release all our long-context Granite code models under an Apache 2.0 license for both research and commercial use.
Related papers
- NExtLong: Toward Effective Long-Context Training without Long Documents [28.002824369635768]
We propose NExtLong, a novel framework for long-context data through Negative document Extension.
NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora.
Extensive experiments demonstrate that NExtLong achieves significant performance improvements compared to existing long-context synthesis approaches.
arXiv Detail & Related papers (2025-01-22T10:01:54Z) - How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information.
ProLong-8B, which is from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z) - ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities [53.97515452727115]
ChatQA 2 is a Llama 3.0-based model with a 128K context window.
We present a training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens.
We find that the performance of strong long-context LLMs using RAG improves when retrieving a larger number of chunks.
arXiv Detail & Related papers (2024-07-19T17:35:47Z) - LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens.
We develop two novel methods for creating synthetic data.
LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z) - LongEmbed: Extending Embedding Models for Long Context Retrieval [87.60404151086715]
This paper explores context window extension of embedding models, pushing the limit to 32k without requiring additional training.
First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark.
Experiments show that training-free context window extension strategies like positionRo can effectively extend the context window of existing embedding models by several folds.
arXiv Detail & Related papers (2024-04-18T11:29:23Z) - Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training.
By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z) - Long-Context Language Modeling with Parallel Context Encoding [37.64884969997378]
We introduce a framework that can be applied to any existing decoder-only LLMs to extend their context window.
CEPE employs a small encoder to process long inputs chunk by chunk, enabling the frozen decoder to utilize additional contexts via cross-attention.
CEPE yields strong performance on language modeling and in-context learning.
arXiv Detail & Related papers (2024-02-26T14:47:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.