DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering
- URL: http://arxiv.org/abs/2307.06869v1
- Date: Thu, 13 Jul 2023 16:16:51 GMT
- Title: DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering
- Authors: Pei Ke, Fei Huang, Fei Mi, Yasheng Wang, Qun Liu, Xiaoyan Zhu, Minlie
Huang
- Abstract summary: Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
- Score: 95.89707479748161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing evaluation metrics for natural language generation (NLG) tasks face
the challenges on generalization ability and interpretability. Specifically,
most of the well-performed metrics are required to train on evaluation datasets
of specific NLG tasks and evaluation dimensions, which may cause over-fitting
to task-specific datasets. Furthermore, existing metrics only provide an
evaluation score for each dimension without revealing the evidence to interpret
how this score is obtained. To deal with these challenges, we propose a simple
yet effective metric called DecompEval. This metric formulates NLG evaluation
as an instruction-style question answering task and utilizes instruction-tuned
pre-trained language models (PLMs) without training on evaluation datasets,
aiming to enhance the generalization ability. To make the evaluation process
more interpretable, we decompose our devised instruction-style question about
the quality of generated texts into the subquestions that measure the quality
of each sentence. The subquestions with their answers generated by PLMs are
then recomposed as evidence to obtain the evaluation result. Experimental
results show that DecompEval achieves state-of-the-art performance in untrained
metrics for evaluating text summarization and dialogue generation, which also
exhibits strong dimension-level / task-level generalization ability and
interpretability.
Related papers
- Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks.
We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement.
We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z) - Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation [2.4889060833127665]
In this paper, we focus on evaluating the instruction-following ability of Large Language Models (LLMs) in the context of story-ending generation.
We propose an automatic evaluation pipeline that utilizes a machine reading comprehension (MRC) model to determine whether the generated story-ending reflects instruction.
arXiv Detail & Related papers (2024-06-24T06:53:36Z) - Evaluation Metrics of Language Generation Models for Synthetic Traffic
Generation Tasks [22.629816738693254]
We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG)
We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
arXiv Detail & Related papers (2023-11-21T11:26:26Z) - Automatic Evaluation of Generative Models with Instruction Tuning [14.369719297698694]
Recent paradigm fine-tunes pre-trained language models to emulate human judgements for a particular task and evaluation criterion.
Inspired by the generalization ability of instruction-tuned models, we propose a learned metric based on instruction tuning.
arXiv Detail & Related papers (2023-10-30T23:00:52Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - CTRLEval: An Unsupervised Reference-Free Metric for Evaluating
Controlled Text Generation [85.03709740727867]
We propose an unsupervised reference-free metric calledEval to evaluate controlled text generation models.
Eval assembles the generation probabilities from a pre-trained language model without any model training.
Experimental results show that our metric has higher correlations with human judgments than other baselines.
arXiv Detail & Related papers (2022-04-02T13:42:49Z) - GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.
GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.