INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback
- URL: http://arxiv.org/abs/2305.14282v3
- Date: Thu, 26 Oct 2023 18:21:30 GMT
- Title: INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback
- Authors: Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag,
William Yang Wang, Lei Li
- Abstract summary: We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
- Score: 80.57617091714448
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatically evaluating the quality of language generation is critical.
Although recent learned metrics show high correlation with human judgement,
these metrics can not explain their verdict or associate the scores with
defects in generated text. To address this limitation, we present
InstructScore, an explainable evaluation metric for text generation. By
harnessing both explicit human instruction and the implicit knowledge of GPT-4,
we fine-tune a text evaluation metric based on LLaMA, producing both a score
for generated text and a human readable diagnostic report. We evaluate
InstructScore on a variety of generation tasks, including translation,
captioning, data-to-text and commonsense generation. Experiments show that our
7B model surpasses all other unsupervised metrics, including those based on
175B GPT-3 and GPT-4. Surprisingly, our InstructScore, even without direct
supervision from human-rated data, achieves performance levels on par with
state-of-the-art metrics like COMET22, which were fine-tuned on human ratings.
Related papers
- CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks [44.801746603656504]
We present TIGERScore, a metric that follows textbfInstruction textbfGuidance to perform textbfExplainable and textbfReference-free evaluation.
Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct.
arXiv Detail & Related papers (2023-10-01T18:01:51Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - CTRLEval: An Unsupervised Reference-Free Metric for Evaluating
Controlled Text Generation [85.03709740727867]
We propose an unsupervised reference-free metric calledEval to evaluate controlled text generation models.
Eval assembles the generation probabilities from a pre-trained language model without any model training.
Experimental results show that our metric has higher correlations with human judgments than other baselines.
arXiv Detail & Related papers (2022-04-02T13:42:49Z) - GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.
GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.