Related papers: Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

URL: http://arxiv.org/abs/2407.01370v1
Date: Mon, 1 Jul 2024 15:23:42 GMT
Title: Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
Authors: Philippe Laban, Alexander R. Fabbri, Caiming Xiong, Chien-Sheng Wu,
Abstract summary: We design a procedure to synthesize Haystacks of documents, ensuring that specific textitinsights repeat across documents. The "Summary of a Haystack" (SummHay) task then requires a system to process the Haystack and generate, given a query, a summary that identifies the relevant insights and precisely cites the source documents.
Score: 124.82815637571413
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In this work, we argue that summarization can play a central role in such evaluation. We design a procedure to synthesize Haystacks of documents, ensuring that specific \textit{insights} repeat across documents. The "Summary of a Haystack" (SummHay) task then requires a system to process the Haystack and generate, given a query, a summary that identifies the relevant insights and precisely cites the source documents. Since we have precise knowledge of what insights should appear in a haystack summary and what documents should be cited, we implement a highly reproducible automatic evaluation that can score summaries on two aspects - Coverage and Citation. We generate Haystacks in two domains (conversation, news), and perform a large-scale evaluation of 10 LLMs and corresponding 50 RAG systems. Our findings indicate that SummHay is an open challenge for current systems, as even systems provided with an Oracle signal of document relevance lag our estimate of human performance (56\%) by 10+ points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to study enterprise RAG systems and position bias in long-context models. We hope future systems can equal and surpass human performance on SummHay.

Related papers

Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z)
ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge [49.65993318863458]
ImpliRet is a benchmark that shifts the reasoning challenge to document-side processing.<n>We evaluate a range of sparse and dense retrievers, all of which struggle in this setting.
arXiv Detail & Related papers (2025-06-17T11:08:29Z)
Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z)
SUNAR: Semantic Uncertainty based Neighborhood Aware Retrieval for Complex QA [2.7703990035016868]
We introduce SUNAR, a novel approach that leverages large language models to guide a Neighborhood Aware Retrieval process. We validate our approach through extensive experiments on two complex QA datasets. Our results show that SUNAR significantly outperforms existing retrieve-and-reason baselines, achieving up to a 31.84% improvement in performance.
arXiv Detail & Related papers (2025-03-23T08:50:44Z)
DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems [99.17123445211115]
We introduce DocBench, a benchmark to evaluate large language model (LLM)-based document reading systems. Our benchmark involves the recruitment of human annotators and the generation of synthetic questions. It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions.
arXiv Detail & Related papers (2024-07-15T13:17:42Z)
Attribute or Abstain: Large Language Models as Long Document Assistants [58.32043134560244]
LLMs can help humans working with long documents, but are known to hallucinate. Existing approaches to attribution have only been evaluated in RAG settings, where the initial retrieval confounds LLM performance. This is crucially different from the long document setting, where retrieval is not needed, but could help. We present LAB, a benchmark of 6 diverse long document tasks with attribution, and experiments with different approaches to attribution on 5 LLMs of different sizes.
arXiv Detail & Related papers (2024-07-10T16:16:02Z)
Harnessing Multi-Role Capabilities of Large Language Models for Open-Domain Question Answering [40.2758450304531]
Open-domain question answering (ODQA) has emerged as a pivotal research spotlight in information systems. We propose a framework that formulates the ODQA process into three basic steps: query expansion, document selection, and answer generation. We introduce a novel prompt optimization algorithm to refine role-playing prompts and steer LLMs to produce higher-quality evidence and answers.
arXiv Detail & Related papers (2024-03-08T11:09:13Z)
DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR) While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z)
DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System for Multilingual Named Entity Recognition [94.90258603217008]
The MultiCoNER RNum2 shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios. Previous top systems in the MultiCoNER RNum1 either incorporate the knowledge bases or gazetteers. We propose a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER.
arXiv Detail & Related papers (2023-05-05T16:59:26Z)
Visconde: Multi-document QA with GPT-3 and Neural Reranking [4.9069311006119865]
This paper proposes a question-answering system that can answer questions whose supporting evidence is spread over multiple documents. The system, called Visconde, uses a three-step pipeline to perform the task: decompose, retrieve, and aggregate.
arXiv Detail & Related papers (2022-12-19T17:39:07Z)
Text Summarization with Latent Queries [60.468323530248945]
We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms. Under a deep generative framework, our system jointly optimize a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time. Our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.
arXiv Detail & Related papers (2021-05-31T21:14:58Z)
Hurdles to Progress in Long-form Question Answering [34.805039943215284]
We show that the task formulation raises fundamental challenges regarding evaluation and dataset creation. We first design a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-03-10T20:32:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.