LongEmbed: Extending Embedding Models for Long Context Retrieval
- URL: http://arxiv.org/abs/2404.12096v3
- Date: Thu, 07 Nov 2024 09:29:32 GMT
- Title: LongEmbed: Extending Embedding Models for Long Context Retrieval
- Authors: Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li,
- Abstract summary: This paper explores context window extension of embedding models, pushing the limit to 32k without requiring additional training.
First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark.
Experiments show that training-free context window extension strategies like positionRo can effectively extend the context window of existing embedding models by several folds.
- Score: 87.60404151086715
- License:
- Abstract: Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.
Related papers
- How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information.
ProLong-8B, which is from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z) - ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities [53.97515452727115]
ChatQA 2 is a Llama 3.0-based model with a 128K context window.
We present a training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens.
Our results demonstrate that the Llama3-ChatQA-2-70B model outperforms most existing state-of-the-art models.
arXiv Detail & Related papers (2024-07-19T17:35:47Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - Long-Context Language Modeling with Parallel Context Encoding [37.64884969997378]
We introduce a framework that can be applied to any existing decoder-only LLMs to extend their context window.
CEPE employs a small encoder to process long inputs chunk by chunk, enabling the frozen decoder to utilize additional contexts via cross-attention.
CEPE yields strong performance on language modeling and in-context learning.
arXiv Detail & Related papers (2024-02-26T14:47:35Z) - LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens [7.833740464264734]
Current extended context windows are limited to around 128k tokens.
LongRoPE extends the context window of pre-trained LLMs to an impressive 2048k tokens.
arXiv Detail & Related papers (2024-02-21T12:30:33Z) - LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs.
With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z) - CLEX: Continuous Length Extrapolation for Large Language Models [68.43814043853347]
We propose Continuous Length EXtrapolation (CLEX) for Large Language Models (LLMs)
CLEX extends the context window to over 4x or almost 8x training length, with no deterioration in performance.
Our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
arXiv Detail & Related papers (2023-10-25T08:13:02Z) - YaRN: Efficient Context Window Extension of Large Language Models [1.024113475677323]
Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models.
We present YaRN, a compute-efficient method to extend the context window of such models.
We show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow.
arXiv Detail & Related papers (2023-08-31T18:18:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.