CBench: Towards Better Evaluation of Question Answering Over Knowledge
Graphs
- URL: http://arxiv.org/abs/2105.00811v1
- Date: Mon, 5 Apr 2021 15:41:14 GMT
- Title: CBench: Towards Better Evaluation of Question Answering Over Knowledge
Graphs
- Authors: Abdelghny Orogat, Isabelle Liu, Ahmed El-Roby
- Abstract summary: We introduce CBench, an informative benchmarking suite for analyzing benchmarks and evaluating question answering systems.
CBench can be used to analyze existing benchmarks with respect to several fine-grained linguistic, syntactic, and structural properties of the questions and queries.
- Score: 3.631024220680066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there has been an increase in the number of knowledge graphs that
can be only queried by experts. However, describing questions using structured
queries is not straightforward for non-expert users who need to have sufficient
knowledge about both the vocabulary and the structure of the queried knowledge
graph, as well as the syntax of the structured query language used to describe
the user's information needs. The most popular approach introduced to overcome
the aforementioned challenges is to use natural language to query these
knowledge graphs. Although several question answering benchmarks can be used to
evaluate question-answering systems over a number of popular knowledge graphs,
choosing a benchmark to accurately assess the quality of a question answering
system is a challenging task.
In this paper, we introduce CBench, an extensible, and more informative
benchmarking suite for analyzing benchmarks and evaluating question answering
systems. CBench can be used to analyze existing benchmarks with respect to
several fine-grained linguistic, syntactic, and structural properties of the
questions and queries in the benchmark. We show that existing benchmarks vary
significantly with respect to these properties deeming choosing a small subset
of them unreliable in evaluating QA systems. Until further research improves
the quality and comprehensiveness of benchmarks, CBench can be used to
facilitate this evaluation using a set of popular benchmarks that can be
augmented with other user-provided benchmarks. CBench not only evaluates a
question answering system based on popular single-number metrics but also gives
a detailed analysis of the linguistic, syntactic, and structural properties of
answered and unanswered questions to better help the developers of question
answering systems to better understand where their system excels and where it
struggles.
Related papers
- Do You Know What You Are Talking About? Characterizing Query-Knowledge Relevance For Reliable Retrieval Augmented Generation [19.543102037001134]
Language models (LMs) are known to suffer from hallucinations and misinformation.
Retrieval augmented generation (RAG) that retrieves verifiable information from an external knowledge corpus provides a tangible solution to these problems.
RAG generation quality is highly dependent on the relevance between a user's query and the retrieved documents.
arXiv Detail & Related papers (2024-10-10T19:14:55Z) - STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases.
Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine.
We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - Evaluation of Question Generation Needs More References [7.876222232341623]
We propose to paraphrase the reference question for a more robust QG evaluation.
Using large language models such as GPT-3, we created semantically and syntactically diverse questions.
arXiv Detail & Related papers (2023-05-26T04:40:56Z) - SkillQG: Learning to Generate Question for Reading Comprehension
Assessment [54.48031346496593]
We present a question generation framework with controllable comprehension types for assessing and improving machine reading comprehension models.
We first frame the comprehension type of questions based on a hierarchical skill-based schema, then formulate $textttSkillQG$ as a skill-conditioned question generator.
Empirical results demonstrate that $textttSkillQG$ outperforms baselines in terms of quality, relevance, and skill-controllability.
arXiv Detail & Related papers (2023-05-08T14:40:48Z) - Multiple-Choice Question Generation: Towards an Automated Assessment
Framework [0.0]
transformer-based pretrained language models have demonstrated the ability to produce appropriate questions from a context paragraph.
We focus on a fully automated multiple-choice question generation (MCQG) system where both the question and possible answers must be generated from the context paragraph.
arXiv Detail & Related papers (2022-09-23T19:51:46Z) - A Benchmark for Generalizable and Interpretable Temporal Question
Answering over Knowledge Bases [67.33560134350427]
TempQA-WD is a benchmark dataset for temporal reasoning.
It is based on Wikidata, which is the most frequently curated, openly available knowledge base.
arXiv Detail & Related papers (2022-01-15T08:49:09Z) - BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation [16.81712151903078]
Natural language processing (NLP) systems are increasingly trained to generate open-ended text.
Different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others.
Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics) to make research into new metrics itself easier to evaluate.
arXiv Detail & Related papers (2021-10-18T10:03:19Z) - Open-Retrieval Conversational Machine Reading [80.13988353794586]
In conversational machine reading, systems need to interpret natural language rules, answer high-level questions, and ask follow-up clarification questions.
Existing works assume the rule text is provided for each user question, which neglects the essential retrieval step in real scenarios.
In this work, we propose and investigate an open-retrieval setting of conversational machine reading.
arXiv Detail & Related papers (2021-02-17T08:55:01Z) - Exploring and Analyzing Machine Commonsense Benchmarks [0.13999481573773073]
We argue that the lack of a common vocabulary for aligning these approaches' metadata limits researchers in their efforts to understand systems' deficiencies.
We describe our initial MCS Benchmark Ontology, an common vocabulary that formalizes benchmark metadata.
arXiv Detail & Related papers (2020-12-21T19:01:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.