Evaluating Open-Domain Question Answering in the Era of Large Language
Models
- URL: http://arxiv.org/abs/2305.06984v3
- Date: Thu, 6 Jul 2023 18:52:08 GMT
- Title: Evaluating Open-Domain Question Answering in the Era of Large Language
Models
- Authors: Ehsan Kamalloo, Nouha Dziri, Charles L. A. Clarke, Davood Rafiei
- Abstract summary: Lexical matching remains the de facto evaluation method for open-domain question answering (QA)
Recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer.
Without accurate evaluation, the true progress in open-domain QA remains unknown.
- Score: 9.144650595481377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lexical matching remains the de facto evaluation method for open-domain
question answering (QA). Unfortunately, lexical matching fails completely when
a plausible candidate answer does not appear in the list of gold answers, which
is increasingly the case as we shift from extractive to generative models. The
recent success of large language models (LLMs) for QA aggravates lexical
matching failures since candidate answers become longer, thereby making
matching with the gold answers even more challenging. Without accurate
evaluation, the true progress in open-domain QA remains unknown. In this paper,
we conduct a thorough analysis of various open-domain QA models, including
LLMs, by manually evaluating their answers on a subset of NQ-open, a popular
benchmark. Our assessments reveal that while the true performance of all models
is significantly underestimated, the performance of the InstructGPT (zero-shot)
LLM increases by nearly +60%, making it on par with existing top models, and
the InstructGPT (few-shot) model actually achieves a new state-of-the-art on
NQ-open. We also find that more than 50% of lexical matching failures are
attributed to semantically equivalent answers. We further demonstrate that
regex matching ranks QA models consistent with human judgments, although still
suffering from unnecessary strictness. Finally, we demonstrate that automated
evaluation models are a reasonable surrogate for lexical matching in some
circumstances, but not for long-form answers generated by LLMs. The automated
models struggle in detecting hallucinations in LLM answers and are thus unable
to evaluate LLMs. At this time, there appears to be no substitute for human
evaluation.
Related papers
- LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion.
Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks.
We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z) - WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs)
In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia.
We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Investigating Data Contamination in Modern Benchmarks for Large Language Models [27.479260572913724]
Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs.
We study data contamination by proposing two methods tailored for both open-source and proprietary LLMs.
We find that certain commercial LLMs could surprisingly guess the missing option in various test sets.
arXiv Detail & Related papers (2023-11-16T11:03:04Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Automatic Evaluation of Attribution by Large Language Models [24.443271739599194]
We investigate the automatic evaluation of attribution given by large language models (LLMs)
We begin by defining different types of attribution errors, and then explore two approaches for automatic evaluation.
We manually curate a set of test examples covering 12 domains from a generative search engine, New Bing.
arXiv Detail & Related papers (2023-05-10T16:58:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.