Automatic Evaluation of Attribution by Large Language Models
- URL: http://arxiv.org/abs/2305.06311v2
- Date: Sat, 7 Oct 2023 22:46:33 GMT
- Title: Automatic Evaluation of Attribution by Large Language Models
- Authors: Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su and Huan Sun
- Abstract summary: We investigate the automatic evaluation of attribution given by large language models (LLMs)
We begin by defining different types of attribution errors, and then explore two approaches for automatic evaluation.
We manually curate a set of test examples covering 12 domains from a generative search engine, New Bing.
- Score: 24.443271739599194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recent focus of large language model (LLM) development, as exemplified by
generative search engines, is to incorporate external references to generate
and support its claims. However, evaluating the attribution, i.e., verifying
whether the generated statement is fully supported by the cited reference,
remains an open problem. Although human evaluation is common practice, it is
costly and time-consuming. In this paper, we investigate the automatic
evaluation of attribution given by LLMs. We begin by defining different types
of attribution errors, and then explore two approaches for automatic
evaluation: prompting LLMs and fine-tuning smaller LMs. The fine-tuning data is
repurposed from related tasks such as question answering, fact-checking,
natural language inference, and summarization. We manually curate a set of test
examples covering 12 domains from a generative search engine, New Bing. Our
results on this curated test set and simulated examples from existing
benchmarks highlight both promising signals and challenges. We hope our problem
formulation, testbeds, and findings will help lay the foundation for future
studies on this important problem.
Related papers
- Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation [8.975024781390077]
We present MIRAGE --Model Internals-based RAG Explanations -- a plug-and-play approach using model internals for faithful answer attribution in question answering applications.
We evaluate our proposed approach on a multilingual QA dataset, finding high agreement with human answer attribution.
arXiv Detail & Related papers (2024-06-19T16:10:26Z) - Automatic benchmarking of large multimodal models via iterative experiment programming [71.78089106671581]
We present APEx, the first framework for automatic benchmarking of LMMs.
Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand.
The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions.
arXiv Detail & Related papers (2024-06-18T06:43:46Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations [34.99831757956635]
We formulate the task of attributed query-focused summarization (AQFS) and present WebCiteS, a Chinese dataset featuring 7k human-annotated summaries with citations.
We tackle these issues by developing detailed metrics and enabling the automatic evaluator to decompose the sentences into sub-claims for fine-grained verification.
arXiv Detail & Related papers (2024-03-04T07:06:41Z) - Investigating Data Contamination in Modern Benchmarks for Large Language Models [27.479260572913724]
Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs.
We study data contamination by proposing two methods tailored for both open-source and proprietary LLMs.
We find that certain commercial LLMs could surprisingly guess the missing option in various test sets.
arXiv Detail & Related papers (2023-11-16T11:03:04Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.