BARTScore: Evaluating Generated Text as Text Generation
- URL: http://arxiv.org/abs/2106.11520v1
- Date: Tue, 22 Jun 2021 03:20:53 GMT
- Title: BARTScore: Evaluating Generated Text as Text Generation
- Authors: Weizhe Yuan and Graham Neubig and Pengfei Liu
- Abstract summary: We conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models.
We operationalize this idea using BART, an encoder-decoder based pre-trained model.
We propose a metric BARTScore with a number of variants that can be flexibly applied to evaluation of text from different perspectives.
- Score: 89.50052670307434
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: A wide variety of NLP applications, such as machine translation,
summarization, and dialog, involve text generation. One major challenge for
these applications is how to evaluate whether such generated texts are actually
fluent, accurate, or effective. In this work, we conceptualize the evaluation
of generated text as a text generation problem, modeled using pre-trained
sequence-to-sequence models. The general idea is that models trained to convert
the generated text to/from a reference output or the source text will achieve
higher scores when the generated text is better. We operationalize this idea
using BART, an encoder-decoder based pre-trained model, and propose a metric
BARTScore with a number of variants that can be flexibly applied in an
unsupervised fashion to evaluation of text from different perspectives (e.g.
informativeness, fluency, or factuality). BARTScore is conceptually simple and
empirically effective. It can outperform existing top-scoring metrics in 16 of
22 test settings, covering evaluation of 16 datasets (e.g., machine
translation, text summarization) and 7 different perspectives (e.g.,
informativeness, factuality). Code to calculate BARTScore is available at
https://github.com/neulab/BARTScore, and we have released an interactive
leaderboard for meta-evaluation at
http://explainaboard.nlpedia.ai/leaderboard/task-meval/ on the ExplainaBoard
platform, which allows us to interactively understand the strengths,
weaknesses, and complementarity of each metric.
Related papers
- CEval: A Benchmark for Evaluating Counterfactual Text Generation [2.899704155417792]
We propose CEval, a benchmark for comparing counterfactual text generation methods.
Our experiments found no perfect method for generating counterfactual text.
By making CEval available as an open-source Python library, we encourage the community to contribute more methods.
arXiv Detail & Related papers (2024-04-26T15:23:47Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z) - DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in
Transformer [94.35116535588332]
Transformer-based methods, which predict polygon points or Bezier curve control points to localize texts, are quite popular in scene text detection.
However, the used point label form implies the reading order of humans, which affects the robustness of Transformer model.
We propose DPText-DETR, which directly uses point coordinates as queries and dynamically updates them between decoder layers.
arXiv Detail & Related papers (2022-07-10T15:45:16Z) - DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence [30.10146423935216]
We introduce DiscoScore, a discourse metric, which uses BERT to model discourse coherence from different perspectives.
Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models.
arXiv Detail & Related papers (2022-01-26T20:28:26Z) - Automatic Text Evaluation through the Lens of Wasserstein Barycenters [24.71226781348407]
A new metric textttBaryScore is introduced to evaluate text generation based on deep contextualized embeddings.
Our results show that textttBaryScore outperforms other BERT based metrics and exhibits more consistent behaviour in particular for text summarization.
arXiv Detail & Related papers (2021-08-27T19:08:52Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.