Related papers: DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence

DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence

URL: http://arxiv.org/abs/2201.11176v2
Date: Fri, 28 Jan 2022 13:12:02 GMT
Title: DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
Authors: Wei Zhao, Michael Strube, Steffen Eger
Abstract summary: We introduce DiscoScore, a discourse metric, which uses BERT to model discourse coherence from different perspectives. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models.
Score: 30.10146423935216
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, there has been a growing interest in designing text generation systems from a discourse coherence perspective, e.g., modeling the interdependence between sentences. Still, recent BERT-based evaluation metrics cannot recognize coherence and fail to punish incoherent elements in system outputs. In this work, we introduce DiscoScore, a parametrized discourse metric, which uses BERT to model discourse coherence from different perspectives, driven by Centering theory. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models, evaluated on summarization and document-level machine translation (MT). We find that (i) the majority of BERT-based metrics correlate much worse with human rated coherence than early discourse metrics, invented a decade ago; (ii) the recent state-of-the-art BARTScore is weak when operated at system level -- which is particularly problematic as systems are typically compared in this manner. DiscoScore, in contrast, achieves strong system-level correlation with human ratings, not only in coherence but also in factual consistency and other aspects, and surpasses BARTScore by over 10 correlation points on average. Further, aiming to understand DiscoScore, we provide justifications to the importance of discourse coherence for evaluation metrics, and explain the superiority of one variant over another. Our code is available at \url{https://github.com/AIPHES/DiscoScore}.

Related papers

Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed. We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document. Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z)
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems [43.5428962271088]
We propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength between dialogue histories and responses. Our experimental results demonstrate that CausalScore significantly surpasses existing state-of-the-art metrics by aligning better with human judgements.
arXiv Detail & Related papers (2024-06-25T06:08:16Z)
C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user. Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z)
Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind. A final factuality score is computed by an adjustable scoring mechanism. Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)
SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations. We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences. Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z)
QRelScore: Better Evaluating Generated Questions with Deeper Understanding of Context-aware Relevance [54.48031346496593]
We propose $textbfQRelScore$, a context-aware evaluation metric for $underlinetextbfRel$evance evaluation metric. Based on off-the-shelf language models such as BERT and GPT2, QRelScore employs both word-level hierarchical matching and sentence-level prompt-based generation. Compared with existing metrics, our experiments demonstrate that QRelScore is able to achieve a higher correlation with human judgments while being much more robust to adversarial samples.
arXiv Detail & Related papers (2022-04-29T07:39:53Z)
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching [90.98122161162644]
Current metrics for video captioning are mostly based on the text-level comparison between reference and candidate captions. We propose EMScore (Embedding Matching-based score), a novel reference-free metric for video captioning. We exploit a well pre-trained vision-language model to extract visual and linguistic embeddings for computing EMScore.
arXiv Detail & Related papers (2021-11-17T06:02:43Z)
Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors [14.238125731862658]
We disentangle metric scores along linguistic factors, including semantics, syntax, morphology, and lexical overlap. We show that the different metrics capture all aspects to some degree, but that they are all substantially sensitive to lexical overlap, just like BLEU and ROUGE.
arXiv Detail & Related papers (2021-10-08T22:40:33Z)
BARTScore: Evaluating Generated Text as Text Generation [89.50052670307434]
We conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. We operationalize this idea using BART, an encoder-decoder based pre-trained model. We propose a metric BARTScore with a number of variants that can be flexibly applied to evaluation of text from different perspectives.
arXiv Detail & Related papers (2021-06-22T03:20:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.