Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
- URL: http://arxiv.org/abs/2406.13121v1
- Date: Wed, 19 Jun 2024 00:28:58 GMT
- Title: Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
- Authors: Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, Kelvin Guu,
- Abstract summary: Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
- Score: 54.667202878390526
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research as context lengths grow. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.
Related papers
- Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation [10.500629810624769]
We study long-context language models evaluation through many-shot in-context learning (ICL)
We identify the skills each ICL task requires, and examine models' long-context capabilities on them.
We introduce a new many-shot ICL benchmark, MANYICLBENCH, designed to characterize LCLMs' retrieval and global context understanding capabilities separately.
arXiv Detail & Related papers (2024-11-11T17:00:59Z) - Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data [6.195658947075431]
We introduce HoloBench, a framework that brings database reasoning operations into text-based contexts.
We show that the amount of information in the context has a bigger influence on LCLM performance than the context length.
We find that tasks requiring the aggregation of multiple pieces of information show a noticeable drop in accuracy as context length increases.
arXiv Detail & Related papers (2024-10-15T19:04:13Z) - Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently.
Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting.
We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z) - A Reality check of the benefits of LLM in business [1.9181612035055007]
Large language models (LLMs) have achieved remarkable performance in language understanding and generation tasks.
This paper thoroughly examines the usefulness and readiness of LLMs for business processes.
arXiv Detail & Related papers (2024-06-09T02:36:00Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - From Summary to Action: Enhancing Large Language Models for Complex
Tasks with Open World APIs [62.496139001509114]
We introduce a novel tool invocation pipeline designed to control massive real-world APIs.
This pipeline mirrors the human task-solving process, addressing complicated real-life user queries.
Empirical evaluations of our Sum2Act pipeline on the ToolBench benchmark show significant performance improvements.
arXiv Detail & Related papers (2024-02-28T08:42:23Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation
with Large Language Models [12.708117108874083]
Large Language Models (LLMs) generate code snippets given natural language intents in zero-shot, i.e., without the need for specific fine-tuning.
Previous research explored In-Context Learning (ICL) as a strategy to guide the LLM generative process with task-specific prompt examples.
In this paper, we deliver a comprehensive study of.
PEFT techniques for LLMs under the automated code generation scenario.
arXiv Detail & Related papers (2023-08-21T04:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.