Related papers: BARTScore: Evaluating Generated Text as Text Generation

BARTScore: Evaluating Generated Text as Text Generation

URL: http://arxiv.org/abs/2106.11520v1
Date: Tue, 22 Jun 2021 03:20:53 GMT
Title: BARTScore: Evaluating Generated Text as Text Generation
Authors: Weizhe Yuan and Graham Neubig and Pengfei Liu
Abstract summary: We conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. We operationalize this idea using BART, an encoder-decoder based pre-trained model. We propose a metric BARTScore with a number of variants that can be flexibly applied to evaluation of text from different perspectives.
Score: 89.50052670307434
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference output or the source text will achieve higher scores when the generated text is better. We operationalize this idea using BART, an encoder-decoder based pre-trained model, and propose a metric BARTScore with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives (e.g. informativeness, fluency, or factuality). BARTScore is conceptually simple and empirically effective. It can outperform existing top-scoring metrics in 16 of 22 test settings, covering evaluation of 16 datasets (e.g., machine translation, text summarization) and 7 different perspectives (e.g., informativeness, factuality). Code to calculate BARTScore is available at https://github.com/neulab/BARTScore, and we have released an interactive leaderboard for meta-evaluation at http://explainaboard.nlpedia.ai/leaderboard/task-meval/ on the ExplainaBoard platform, which allows us to interactively understand the strengths, weaknesses, and complementarity of each metric.

Related papers

CEval: A Benchmark for Evaluating Counterfactual Text Generation [2.899704155417792]
We propose CEval, a benchmark for comparing counterfactual text generation methods. Our experiments found no perfect method for generating counterfactual text. By making CEval available as an open-source Python library, we encourage the community to contribute more methods.
arXiv Detail & Related papers (2024-04-26T15:23:47Z)
Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection. Our approach achieves better generation quality according to both automatic and human evaluations. Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z)
Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind. A final factuality score is computed by an adjustable scoring mechanism. Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone. T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z)
DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer [94.35116535588332]
Transformer-based methods, which predict polygon points or Bezier curve control points to localize texts, are quite popular in scene text detection. However, the used point label form implies the reading order of humans, which affects the robustness of Transformer model. We propose DPText-DETR, which directly uses point coordinates as queries and dynamically updates them between decoder layers.
arXiv Detail & Related papers (2022-07-10T15:45:16Z)
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence [30.10146423935216]
We introduce DiscoScore, a discourse metric, which uses BERT to model discourse coherence from different perspectives. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models.
arXiv Detail & Related papers (2022-01-26T20:28:26Z)
Automatic Text Evaluation through the Lens of Wasserstein Barycenters [24.71226781348407]
A new metric textttBaryScore is introduced to evaluate text generation based on deep contextualized embeddings. Our results show that textttBaryScore outperforms other BERT based metrics and exhibits more consistent behaviour in particular for text summarization.
arXiv Detail & Related papers (2021-08-27T19:08:52Z)
POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation. The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner. The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.