CKBP v2: An Expert-Annotated Evaluation Set for Commonsense Knowledge
Base Population
- URL: http://arxiv.org/abs/2304.10392v1
- Date: Thu, 20 Apr 2023 15:27:29 GMT
- Title: CKBP v2: An Expert-Annotated Evaluation Set for Commonsense Knowledge
Base Population
- Authors: Tianqing Fang, Quyet V. Do, Sehyun Choi, Weiqi Wang, Yangqiu Song
- Abstract summary: We introduce CKBP v2, a new high-quality CSKB Population benchmark.
We conduct experiments comparing state-of-the-art methods for CSKB Population on the new evaluation set.
- Score: 27.48660712102029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Populating Commonsense Knowledge Bases (CSKB) is an important yet hard task
in NLP, as it tackles knowledge from external sources with unseen events and
entities. Fang et al. (2021a) proposed a CSKB Population benchmark with an
evaluation set CKBP v1. However, CKBP v1 adopts crowdsourced annotations that
suffer from a substantial fraction of incorrect answers, and the evaluation set
is not well-aligned with the external knowledge source as a result of random
sampling. In this paper, we introduce CKBP v2, a new high-quality CSKB
Population benchmark, which addresses the two mentioned problems by using
experts instead of crowd-sourced annotation and by adding diversified
adversarial samples to make the evaluation set more representative. We conduct
extensive experiments comparing state-of-the-art methods for CSKB Population on
the new evaluation set for future research comparisons. Empirical results show
that the population task is still challenging, even for large language models
(LLM) such as ChatGPT. Codes and data are available at
https://github.com/HKUST-KnowComp/CSKB-Population.
Related papers
- Extractive text summarisation of Privacy Policy documents using machine learning approaches [0.0]
This work demonstrates two Privacy Policy (PP) summarisation models based on two different clustering algorithms.
Kmeans is be used for the first model after an extensive evaluation of ten commonly used clustering algorithms.
The summariser model based on the PDC-clustering algorithm summarises PP documents by segregating individual sentences by distance each sentence to the pre-defined cluster centres.
arXiv Detail & Related papers (2024-04-09T04:54:08Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - L-C2ST: Local Diagnostics for Posterior Approximations in
Simulation-Based Inference [63.22081662149488]
L-C2ST allows for a local evaluation of the posterior estimator at any given observation.
It offers theoretically grounded and easy to interpret.
On standard SBI benchmarks, L-C2ST provides comparable results to C2ST and outperforms alternative local approaches.
arXiv Detail & Related papers (2023-06-06T10:53:26Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z) - Do You Hear The People Sing? Key Point Analysis via Iterative Clustering
and Abstractive Summarisation [12.548947151123555]
Argument summarisation is a promising but currently under-explored field.
One of the main challenges in Key Point Analysis is finding high-quality key point candidates.
evaluating key points is crucial in ensuring that the automatically generated summaries are useful.
arXiv Detail & Related papers (2023-05-25T12:43:29Z) - CAR: Conceptualization-Augmented Reasoner for Zero-Shot Commonsense
Question Answering [56.592385613002584]
We propose Conceptualization-Augmented Reasoner (CAR) to tackle the task of zero-shot commonsense question answering.
CAR abstracts a commonsense knowledge triple to many higher-level instances, which increases the coverage of CommonSense Knowledge Bases.
CAR more robustly generalizes to answering questions about zero-shot commonsense scenarios than existing methods.
arXiv Detail & Related papers (2023-05-24T08:21:31Z) - Reinforcement Learning with Heterogeneous Data: Estimation and Inference [84.72174994749305]
We introduce the K-Heterogeneous Markov Decision Process (K-Hetero MDP) to address sequential decision problems with population heterogeneity.
We propose the Auto-Clustered Policy Evaluation (ACPE) for estimating the value of a given policy, and the Auto-Clustered Policy Iteration (ACPI) for estimating the optimal policy in a given policy class.
We present simulations to support our theoretical findings, and we conduct an empirical study on the standard MIMIC-III dataset.
arXiv Detail & Related papers (2022-01-31T20:58:47Z) - Benchmarking Commonsense Knowledge Base Population with an Effective
Evaluation Dataset [37.02104430195374]
Reasoning over commonsense knowledge bases (CSKB) whose elements are in the form of free-text is an important yet hard task in NLP.
We benchmark the CSKB population task with a new large-scale dataset.
We also propose a novel inductive commonsense reasoning model that reasons over graphs.
arXiv Detail & Related papers (2021-09-16T02:50:01Z) - Ranking vs. Classifying: Measuring Knowledge Base Completion Quality [10.06803520598035]
We argue that consideration of binary predictions is essential to reflect the actual KBC quality.
We simulate the realistic scenario of real-world entities missing from a KB.
We evaluate a number of state-of-the-art KB embeddings models on our new benchmark.
arXiv Detail & Related papers (2021-02-02T17:53:48Z) - Beyond I.I.D.: Three Levels of Generalization for Question Answering on
Knowledge Bases [63.43418760818188]
We release a new large-scale, high-quality dataset with 64,331 questions, GrailQA.
We propose a novel BERT-based KBQA model.
The combination of our dataset and model enables us to thoroughly examine and demonstrate, for the first time, the key role of pre-trained contextual embeddings like BERT in the generalization of KBQA.
arXiv Detail & Related papers (2020-11-16T06:36:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.