TRUE: Re-evaluating Factual Consistency Evaluation
- URL: http://arxiv.org/abs/2204.04991v1
- Date: Mon, 11 Apr 2022 10:14:35 GMT
- Title: TRUE: Re-evaluating Factual Consistency Evaluation
- Authors: Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron
Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim,
Yossi Matias
- Abstract summary: We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
- Score: 29.888885917330327
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grounded text generation systems often generate text that contains factual
inconsistencies, hindering their real-world applicability. Automatic factual
consistency evaluation may help alleviate this limitation by accelerating
evaluation cycles, filtering inconsistent outputs and augmenting training data.
While attracting increasing attention, such evaluation metrics are usually
developed and evaluated in silo for a single task or dataset, slowing their
adoption. Moreover, previous meta-evaluation protocols focused on system-level
correlations with human annotations, which leave the example-level accuracy of
such metrics unclear. In this work, we introduce TRUE: a comprehensive study of
factual consistency metrics on a standardized collection of existing texts from
diverse tasks, manually annotated for factual consistency. Our standardization
enables an example-level meta-evaluation protocol that is more actionable and
interpretable than previously reported correlations, yielding clearer quality
measures. Across diverse state-of-the-art metrics and 11 datasets we find that
large-scale NLI and question generation-and-answering-based approaches achieve
strong and complementary results. We recommend those methods as a starting
point for model and metric developers, and hope TRUE will foster progress
towards even better methods.
Related papers
- Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and
Improvement of Large Language Models [4.953092503184905]
This work proposes DCR, an automated framework for evaluating and improving the consistency of Large Language Models (LLMs) generated texts.
We introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score.
Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
arXiv Detail & Related papers (2024-01-04T08:34:16Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - ED-FAITH: Evaluating Dialogue Summarization on Faithfulness [35.73012379398233]
We first present a systematic study of faithfulness metrics for dialogue summarization.
We observe that most metrics correlate poorly with human judgements despite performing well on news datasets.
We propose T0-Score -- a new metric for faithfulness evaluation.
arXiv Detail & Related papers (2022-11-15T19:33:50Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Spurious Correlations in Reference-Free Evaluation of Text Generation [35.80256755393739]
We show that reference-free evaluation metrics of summarization and dialog generation may be relying on spurious correlations with measures such as word overlap, perplexity, and length.
We demonstrate that these errors can be mitigated by explicitly designing evaluation metrics to avoid spurious features in reference-free evaluation.
arXiv Detail & Related papers (2022-04-21T05:32:38Z) - Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information.
We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols.
We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.