ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities
- URL: http://arxiv.org/abs/2407.14482v1
- Date: Fri, 19 Jul 2024 17:35:47 GMT
- Title: ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities
- Authors: Peng Xu, Wei Ping, Xianchao Wu, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro,
- Abstract summary: ChatQA 2 is a Llama3-based model designed to bridge the gap between open-access LLMs and leading proprietary models.
We present a training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens.
We find that the state-of-the-art long-context retriever can alleviate the top-k context fragmentation issue in RAG.
- Score: 51.587657076291
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we introduce ChatQA 2, a Llama3-based model designed to bridge the gap between open-access LLMs and leading proprietary models (e.g., GPT-4-Turbo) in long-context understanding and retrieval-augmented generation (RAG) capabilities. These two capabilities are essential for LLMs to process large volumes of information that cannot fit into a single prompt and are complementary to each other, depending on the downstream tasks and computational budgets. We present a detailed continued training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model's instruction-following, RAG performance, and long-context understanding capabilities. Our results demonstrate that the Llama3-ChatQA-2-70B model achieves accuracy comparable to GPT-4-Turbo-2024-0409 on many long-context understanding tasks and surpasses it on the RAG benchmark. Interestingly, we find that the state-of-the-art long-context retriever can alleviate the top-k context fragmentation issue in RAG, further improving RAG-based results for long-context understanding tasks. We also provide extensive comparisons between RAG and long-context solutions using state-of-the-art long-context LLMs.
Related papers
- Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach [26.02167477129771]
Retrieval Augmented Generation (RAG) has been a powerful tool for Large Language Models (LLMs) to efficiently process overly lengthy contexts.
We compare RAG and long-context (LC) LLMs, aiming to leverage the strengths of both.
We propose Self-Route, a simple yet effective method that routes queries to RAG or LC based on model self-reflection.
arXiv Detail & Related papers (2024-07-23T20:51:52Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z) - Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models [14.906150451947443]
We propose a benchmark for evaluating long-context Large Language Models (LLMs) named Counting-Stars.
We conduct experiments to evaluate long-context LLMs (i.e., GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1)
Results show that Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo is the most stable across various tasks.
arXiv Detail & Related papers (2024-03-18T14:01:45Z) - Enhancing Textbook Question Answering Task with Large Language Models
and Retrieval Augmented Generation [3.948068081583197]
This paper proposes a methodology that handle the out-of-domain scenario in Textbook question answering (TQA)
Through supervised fine-tuning of the LLM model Llama-2 and the incorporation of RAG, our architecture outperforms the baseline, achieving a 4.12% accuracy improvement on validation set and 9.84% on test set for non-diagram multiple-choice questions.
arXiv Detail & Related papers (2024-02-05T11:58:56Z) - LooGLE: Can Long-Context Language Models Understand Long Contexts? [50.408957515411096]
LooGLE is a benchmark for large language models' long context understanding.
It features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains.
The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings.
arXiv Detail & Related papers (2023-11-08T01:45:37Z) - Retrieval meets Long Context Large Language Models [59.431200671427064]
Extending context window of large language models (LLMs) is getting popular recently.
Retrieval-augmentation versus long context window, which one is better for downstream tasks?
Can both methods be combined to get the best of both worlds?
Our best model, retrieval-augmented Llama2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context tasks.
arXiv Detail & Related papers (2023-10-04T17:59:41Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.