Evaluation of Question Answering Systems: Complexity of judging a
natural language
- URL: http://arxiv.org/abs/2209.12617v1
- Date: Sat, 10 Sep 2022 12:29:04 GMT
- Title: Evaluation of Question Answering Systems: Complexity of judging a
natural language
- Authors: Amer Farea, Zhen Yang, Kien Duong, Nadeesha Perera, and Frank
Emmert-Streib
- Abstract summary: Question answering (QA) systems are among the most important and rapidly developing research topics in natural language processing (NLP)
This survey attempts to provide a systematic overview of the general framework of QA, QA paradigms, benchmark datasets, and assessment techniques for a quantitative evaluation of QA systems.
- Score: 3.4771957347698583
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Question answering (QA) systems are among the most important and rapidly
developing research topics in natural language processing (NLP). A reason,
therefore, is that a QA system allows humans to interact more naturally with a
machine, e.g., via a virtual assistant or search engine. In the last decades,
many QA systems have been proposed to address the requirements of different
question-answering tasks. Furthermore, many error scores have been introduced,
e.g., based on n-gram matching, word embeddings, or contextual embeddings to
measure the performance of a QA system. This survey attempts to provide a
systematic overview of the general framework of QA, QA paradigms, benchmark
datasets, and assessment techniques for a quantitative evaluation of QA
systems. The latter is particularly important because not only is the
construction of a QA system complex but also its evaluation. We hypothesize
that a reason, therefore, is that the quantitative formalization of human
judgment is an open problem.
Related papers
- A Joint-Reasoning based Disease Q&A System [6.117758142183177]
Medical question answer (QA) assistants respond to lay users' health-related queries by synthesizing information from multiple sources.
They can serve as vital tools to alleviate issues of misinformation, information overload, and complexity of medical language.
arXiv Detail & Related papers (2024-01-06T09:55:22Z) - QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing [87.20804165014387]
Questions Under Discussion (QUD) is a versatile linguistic framework in which discourse progresses as continuously asking questions and answering them.
This work introduces the first framework for the automatic evaluation of QUD parsing.
We present QUDeval, a dataset of fine-grained evaluation of 2,190 QUD questions generated from both fine-tuned systems and LLMs.
arXiv Detail & Related papers (2023-10-23T03:03:58Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - ProQA: Structural Prompt-based Pre-training for Unified Question
Answering [84.59636806421204]
ProQA is a unified QA paradigm that solves various tasks through a single model.
It concurrently models the knowledge generalization for all QA tasks while keeping the knowledge customization for every specific QA task.
ProQA consistently boosts performance on both full data fine-tuning, few-shot learning, and zero-shot testing scenarios.
arXiv Detail & Related papers (2022-05-09T04:59:26Z) - Improving the Question Answering Quality using Answer Candidate
Filtering based on Natural-Language Features [117.44028458220427]
We address the problem of how the Question Answering (QA) quality of a given system can be improved.
Our main contribution is an approach capable of identifying wrong answers provided by a QA system.
In particular, our approach has shown its potential while removing in many cases the majority of incorrect answers.
arXiv Detail & Related papers (2021-12-10T11:09:44Z) - NoiseQA: Challenge Set Evaluation for User-Centric Question Answering [68.67783808426292]
We show that components in the pipeline that precede an answering engine can introduce varied and considerable sources of error.
We conclude that there is substantial room for progress before QA systems can be effectively deployed.
arXiv Detail & Related papers (2021-02-16T18:35:29Z) - Retrieving and Reading: A Comprehensive Survey on Open-domain Question
Answering [62.88322725956294]
We review the latest research trends in OpenQA, with particular attention to systems that incorporate neural MRC techniques.
We introduce modern OpenQA architecture named Retriever-Reader'' and analyze the various systems that follow this architecture.
We then discuss key challenges to developing OpenQA systems and offer an analysis of benchmarks that are commonly used.
arXiv Detail & Related papers (2021-01-04T04:47:46Z) - QA2Explanation: Generating and Evaluating Explanations for Question
Answering Systems over Knowledge Graph [4.651476054353298]
We develop an automatic approach for generating explanations during various stages of a pipeline-based QA system.
Our approach is a supervised and automatic approach which considers three classes (i.e., success, no answer, and wrong answer) for annotating the output of involved QA components.
arXiv Detail & Related papers (2020-10-16T11:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.