FactKB: Generalizable Factuality Evaluation using Language Models
Enhanced with Factual Knowledge
- URL: http://arxiv.org/abs/2305.08281v2
- Date: Wed, 18 Oct 2023 23:36:43 GMT
- Title: FactKB: Generalizable Factuality Evaluation using Language Models
Enhanced with Factual Knowledge
- Authors: Shangbin Feng, Vidhisha Balachandran, Yuyang Bai, Yulia Tsvetkov
- Abstract summary: We propose FactKB, a simple new approach to factuality evaluation that is generalizable across domains.
We introduce three types of complementary factuality pretraining objectives based on direct entity facts, facts grounded in auxiliary knowledge about entities, and facts constructed compositionally through knowledge base walks.
The resulting factuality evaluation model achieves state-of-the-art performance on two in-domain news summarization benchmarks and on three out-of-domain scientific literature datasets.
- Score: 37.2179237007464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating the factual consistency of automatically generated summaries is
essential for the progress and adoption of reliable summarization systems.
Despite recent advances, existing factuality evaluation models are not robust,
being especially prone to entity and relation errors in new domains. We propose
FactKB, a simple new approach to factuality evaluation that is generalizable
across domains, in particular with respect to entities and relations. FactKB is
based on language models pretrained using facts extracted from external
knowledge bases. We introduce three types of complementary factuality
pretraining objectives based on direct entity facts, facts grounded in
auxiliary knowledge about entities, and facts constructed compositionally
through knowledge base walks. The resulting factuality evaluation model
achieves state-of-the-art performance on two in-domain news summarization
benchmarks as well as on three out-of-domain scientific literature datasets.
Further analysis of FactKB shows improved ability to detect erroneous entities
and relations in summaries and is robust and generalizable across domains.
Related papers
- ZeFaV: Boosting Large Language Models for Zero-shot Fact Verification [2.6874004806796523]
ZeFaV is a zero-shot based fact-checking verification framework to enhance the performance on fact verification task of large language models.
We conducted empirical experiments to evaluate our approach on two multi-hop fact-checking datasets including HoVer and FEVEROUS.
arXiv Detail & Related papers (2024-11-18T02:35:15Z) - Entity-level Factual Adaptiveness of Fine-tuning based Abstractive
Summarization Models [31.84120883461332]
We analyze the robustness of fine-tuning based summarization models to the knowledge conflict.
We introduce a controllable counterfactual data augmentation method.
arXiv Detail & Related papers (2024-02-23T07:53:39Z) - Deep Outdated Fact Detection in Knowledge Graphs [13.711099395945988]
This paper presents DEAN, a novel deep learning-based framework designed to identify outdated facts within Knowledge Graphs (KGs)
DEAN distinguishes itself by capturing implicit structural information among facts through comprehensive modeling of both entities and relations.
Experimental results demonstrate the effectiveness and superiority of DEAN over state-of-the-art baseline methods.
arXiv Detail & Related papers (2024-02-06T05:58:15Z) - Generating Benchmarks for Factuality Evaluation of Language Models [61.69950787311278]
We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality.
FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements.
We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
arXiv Detail & Related papers (2023-07-13T17:14:38Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Contextualization and Generalization in Entity and Relation Extraction [0.0]
We study the behaviour of state-of-the-art models regarding generalization to facts unseen during training.
Traditional benchmarks present important lexical overlap between mentions and relations used for training and evaluating models.
We propose empirical studies to separate performance based on mention and relation overlap with the training set.
arXiv Detail & Related papers (2022-06-15T14:16:42Z) - Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information.
We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols.
We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Enhancing Factual Consistency of Abstractive Summarization [57.67609672082137]
We propose a fact-aware summarization model FASum to extract and integrate factual relations into the summary generation process.
We then design a factual corrector model FC to automatically correct factual errors from summaries generated by existing systems.
arXiv Detail & Related papers (2020-03-19T07:36:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.