Related papers: A Critical Evaluation of Evaluations for Long-form Question Answering

A Critical Evaluation of Evaluations for Long-form Question Answering

URL: http://arxiv.org/abs/2305.18201v1
Date: Mon, 29 May 2023 16:54:24 GMT
Title: A Critical Evaluation of Evaluations for Long-form Question Answering
Authors: Fangyuan Xu, Yixiao Song, Mohit Iyyer, Eunsol Choi
Abstract summary: Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation. We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices.
Score: 48.51361567469683
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation. We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices. We hire domain experts in seven areas to provide preference judgments over pairs of answers, along with free-form justifications for their choices. We present a careful analysis of experts' evaluation, which focuses on new aspects such as the comprehensiveness of the answer. Next, we examine automatic text generation metrics, finding that no existing metrics are predictive of human preference judgments. However, some metrics correlate with fine-grained aspects of answers (e.g., coherence). We encourage future work to move away from a single "overall score" of the answer and adopt a multi-faceted evaluation, targeting aspects such as factuality and completeness. We publicly release all of our annotations and code to spur future work into LFQA evaluation.

Related papers

An Empirical Study of Evaluating Long-form Question Answering [77.8023489322551]
We collect 5,236 factoid and non-factoid long-form answers generated by different large language models. We conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness. We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics.
arXiv Detail & Related papers (2025-04-25T15:14:25Z)
Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage [74.70255719194819]
We introduce a novel framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question. We use this framework to evaluate three commercial generative answer engines: You.com, Perplexity AI, and Bing Chat. We find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions.
arXiv Detail & Related papers (2024-10-20T22:59:34Z)
Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions [25.158868133182025]
We present a method for evaluating the output of generative large language models (LLMs) Our scoring method correlates with the preferences of human experts. We validate it by investigating the well-known fact that the quality of generated answers improves with the size of the model.
arXiv Detail & Related papers (2024-08-19T09:27:45Z)
Accurate and Nuanced Open-QA Evaluation Through Textual Entailment [4.762213968673381]
We propose to study the entailment relations of answers to identify more informative and more general system answers. The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers.
arXiv Detail & Related papers (2024-05-26T21:33:27Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
ExpertQA: Expert-Curated Questions and Attributed Answers [51.68314045809179]
We conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality. We collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.
arXiv Detail & Related papers (2023-09-14T16:54:34Z)
Continually Improving Extractive QA via Human Feedback [59.49549491725224]
We study continually improving an extractive question answering (QA) system via human user feedback. We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time.
arXiv Detail & Related papers (2023-05-21T14:35:32Z)
AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer Summarization [73.91543616777064]
Community Question Answering (CQA) fora such as Stack Overflow and Yahoo! Answers contain a rich resource of answers to a wide range of community-based questions. One goal of answer summarization is to produce a summary that reflects the range of answer perspectives. This work introduces a novel dataset of 4,631 CQA threads for answer summarization, curated by professional linguists.
arXiv Detail & Related papers (2021-11-11T21:48:02Z)
Exploring Question-Specific Rewards for Generating Deep Questions [42.243227323241584]
We design three different rewards that target to improve the fluency, relevance, and answerability of generated questions. We find that optimizing question-specific rewards generally leads to better performance in automatic evaluation metrics.
arXiv Detail & Related papers (2020-11-02T16:37:30Z)
ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning [35.6375880208001]
This paper introduces a new question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. We also propose a generative evaluation task where a model has to output a ranked list of answers, ideally covering prototypical answers for a question.
arXiv Detail & Related papers (2020-05-02T09:40:05Z)
Review-guided Helpful Answer Identification in E-commerce [38.276241153439955]
Product-specific community question answering platforms can greatly help address the concerns of potential customers. The user-provided answers on such platforms often vary a lot in their qualities. Helpfulness votes from the community can indicate the overall quality of the answer, but they are often missing.
arXiv Detail & Related papers (2020-03-13T11:34:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.