Related papers: Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models

Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models

URL: http://arxiv.org/abs/2002.05058v1
Date: Wed, 12 Feb 2020 15:52:21 GMT
Title: Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models
Authors: Wangchunshu Zhou and Ke Xu
Abstract summary: We propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT. While able to be trained in a fully self-supervised fashion, our model can be further fine-tuned with a little amount of human preference annotation.
Score: 23.62054164511058
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated evaluation of open domain natural language generation (NLG) models remains a challenge and widely used metrics such as BLEU and Perplexity can be misleading in some cases. In our paper, we propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT, which has been shown to have good natural language understanding ability. We also propose to evaluate the model-level quality of NLG models with sample-level comparison results with skill rating system. While able to be trained in a fully self-supervised fashion, our model can be further fine-tuned with a little amount of human preference annotation to better imitate human judgment. In addition to evaluating trained models, we propose to apply our model as a performance indicator during training for better hyperparameter tuning and early-stopping. We evaluate our approach on both story generation and chit-chat dialogue response generation. Experimental results show that our model correlates better with human preference compared with previous automated evaluation approaches. Training with the proposed metric yields better performance in human evaluation, which further demonstrates the effectiveness of the proposed model.

Related papers

A Statistical Framework for Ranking LLM-Based Chatbots [57.59268154690763]
We propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis. First, we introduce a factored tie model that enhances the ability to handle groupings of human-judged comparisons. Second, we extend the framework to model covariance tiers between competitors, enabling deeper insights into performance relationships. Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints.
arXiv Detail & Related papers (2024-12-24T12:54:19Z)
Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only. Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z)
A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation [14.064465097974836]
This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator. We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception.
arXiv Detail & Related papers (2024-06-21T15:11:33Z)
Is Crowdsourcing Breaking Your Bank? Cost-Effective Fine-Tuning of Pre-trained Language Models with Proximal Policy Optimization [18.75866961339424]
ChatGPT has highlighted the potential of reinforcement learning from human feedback. To reduce labor costs, we propose a self-supervised text ranking approach.
arXiv Detail & Related papers (2024-02-28T12:24:07Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Automatic Evaluation of Generative Models with Instruction Tuning [14.369719297698694]
Recent paradigm fine-tunes pre-trained language models to emulate human judgements for a particular task and evaluation criterion. Inspired by the generalization ability of instruction-tuned models, we propose a learned metric based on instruction tuning.
arXiv Detail & Related papers (2023-10-30T23:00:52Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [44.22820310679188]
Large language models achieve state-of-the-art performance on sequence generation evaluation, but typically have a large number of parameters. We propose textbfECT, an textbfevaluation textbfcapability textbftransfer method, to transfer the evaluation capability from LLMs to relatively lightweight language models. Based on the proposed ECT, we learn various evaluation models from ChatGPT, and employ them as reward models to improve sequence generation models.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources [117.6496550359768]
This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets. We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets. We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
arXiv Detail & Related papers (2023-06-07T19:59:23Z)
Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Reconsidering the Past: Optimizing Hidden States in Language Models [35.7524942657169]
We present Hidden-State Optimization (HSO), a gradient-based method for improving the performance of transformer language models. HSO computes the gradient of the log-probability the language model assigns to an evaluation text, but uses it to update the cached hidden states rather than the model parameters.
arXiv Detail & Related papers (2021-12-16T06:14:37Z)
Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z)
Knowledge-Grounded Dialogue Generation with Pre-trained Language Models [74.09352261943911]
We study knowledge-grounded dialogue generation with pre-trained language models. We propose equipping response generation defined by a pre-trained language model with a knowledge selection module.
arXiv Detail & Related papers (2020-10-17T16:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.