Attribute or Abstain: Large Language Models as Long Document Assistants
- URL: http://arxiv.org/abs/2407.07799v1
- Date: Wed, 10 Jul 2024 16:16:02 GMT
- Title: Attribute or Abstain: Large Language Models as Long Document Assistants
- Authors: Jan Buchmann, Xiao Liu, Iryna Gurevych,
- Abstract summary: We present a benchmark of 6 diverse long document tasks with attribution, and experiment with different approaches to attribution on 4 long documents.
We find that citation, i.e. response generation and evidence extraction in one step, mostly performs best.
We also find that evidence quality can predict response quality on datasets with simple responses, but not so for complex responses.
- Score: 58.32043134560244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLMs can help humans working with long documents, but are known to hallucinate. Attribution can increase trust in LLM responses: The LLM provides evidence that supports its response, which enhances verifiability. Existing approaches to attribution have only been evaluated in RAG settings, where the initial retrieval confounds LLM performance. This is crucially different from the long document setting, where retrieval is not needed, but could help. Thus, a long document specific evaluation of attribution is missing. To fill this gap, we present LAB, a benchmark of 6 diverse long document tasks with attribution, and experiment with different approaches to attribution on 4 LLMs of different sizes, both prompted and fine-tuned. We find that citation, i.e. response generation and evidence extraction in one step, mostly performs best. We investigate whether the ``Lost in the Middle'' phenomenon exists for attribution, but do not find this. We also find that evidence quality can predict response quality on datasets with simple responses, but not so for complex responses, as models struggle with providing evidence for complex claims. We release code and data for further investigation.
Related papers
- RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content [13.187520657952263]
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet.
evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions.
We introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks.
arXiv Detail & Related papers (2024-06-17T17:52:54Z) - R4: Reinforced Retriever-Reorder-Responder for Retrieval-Augmented Large Language Models [32.598670876662375]
Retrieval-augmented large language models (LLMs) leverage relevant content retrieved by information retrieval systems to generate correct responses.
Existing retriever-responder methods typically append relevant documents to the prompt of LLMs to perform text generation tasks.
We propose a new pipeline named "Reinforced Retriever-Reorder-Responder" to learn document orderings for retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-05-04T12:59:10Z) - SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs [85.54906813106683]
We propose a simple yet effective framework to enhance open-domain question answering (ODQA) with large language models (LLMs)
SuRe helps LLMs predict more accurate answers for a given question, which are well-supported by the summarized retrieval (SuRe)
Experimental results on diverse ODQA benchmarks demonstrate the superiority of SuRe, with improvements of up to 4.6% in exact match (EM) and 4.0% in F1 score over standard prompting approaches.
arXiv Detail & Related papers (2024-04-17T01:15:54Z) - Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs)
Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities.
We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z) - Harnessing Multi-Role Capabilities of Large Language Models for
Open-Domain Question Answering [40.2758450304531]
Open-domain question answering (ODQA) has emerged as a pivotal research spotlight in information systems.
We propose a framework that formulates the ODQA process into three basic steps: query expansion, document selection, and answer generation.
We introduce a novel prompt optimization algorithm to refine role-playing prompts and steer LLMs to produce higher-quality evidence and answers.
arXiv Detail & Related papers (2024-03-08T11:09:13Z) - UFO: a Unified and Flexible Framework for Evaluating Factuality of Large
Language Models [73.73303148524398]
Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or textithallucination.
We propose textttUFO, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources.
arXiv Detail & Related papers (2024-02-22T16:45:32Z) - LLatrieval: LLM-Verified Retrieval for Verifiable Generation [67.93134176912477]
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents.
We propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question.
Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-11-14T01:38:02Z) - ODSum: New Benchmarks for Open Domain Multi-Document Summarization [30.875191848268347]
Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries.
We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets.
arXiv Detail & Related papers (2023-09-16T11:27:34Z) - Attributed Question Answering: Evaluation and Modeling for Attributed
Large Language Models [68.37431984231338]
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision.
We believe the ability of an LLM to an attribute to the text that it generates is likely to be crucial for both system developers and users in this setting.
arXiv Detail & Related papers (2022-12-15T18:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.