CTRLEval: An Unsupervised Reference-Free Metric for Evaluating
Controlled Text Generation
- URL: http://arxiv.org/abs/2204.00862v1
- Date: Sat, 2 Apr 2022 13:42:49 GMT
- Title: CTRLEval: An Unsupervised Reference-Free Metric for Evaluating
Controlled Text Generation
- Authors: Pei Ke, Hao Zhou, Yankai Lin, Peng Li, Jie Zhou, Xiaoyan Zhu, Minlie
Huang
- Abstract summary: We propose an unsupervised reference-free metric calledEval to evaluate controlled text generation models.
Eval assembles the generation probabilities from a pre-trained language model without any model training.
Experimental results show that our metric has higher correlations with human judgments than other baselines.
- Score: 85.03709740727867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing reference-free metrics have obvious limitations for evaluating
controlled text generation models. Unsupervised metrics can only provide a
task-agnostic evaluation result which correlates weakly with human judgments,
whereas supervised ones may overfit task-specific data with poor generalization
ability to other datasets. In this paper, we propose an unsupervised
reference-free metric called CTRLEval, which evaluates controlled text
generation from different aspects by formulating each aspect into multiple text
infilling tasks. On top of these tasks, the metric assembles the generation
probabilities from a pre-trained language model without any model training.
Experimental results show that our metric has higher correlations with human
judgments than other baselines, while obtaining better generalization of
evaluating generated texts from different models and with different qualities.
Related papers
- Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques [5.011488335517782]
Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation.
This report shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work.
arXiv Detail & Related papers (2024-05-13T16:02:57Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.