Estimating Subjective Crowd-Evaluations as an Additional Objective to
Improve Natural Language Generation
- URL: http://arxiv.org/abs/2104.05224v1
- Date: Mon, 12 Apr 2021 06:33:16 GMT
- Title: Estimating Subjective Crowd-Evaluations as an Additional Objective to
Improve Natural Language Generation
- Authors: Jakob Nyberg, Ramesh Manuvinakurike, Maike Paetzel-Pr\"usmann
- Abstract summary: We use a crowd-authored dialogue corpus to fine-tune six different language generation models.
Two of these models incorporate multi-task learning and use subjective ratings of lines as part of an explicit learning goal.
A human evaluation of the generated dialogue lines reveals that utterances generated by the multi-tasking models were subjectively rated as the most typical, most moving the conversation forward, and least offensive.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human ratings are one of the most prevalent methods to evaluate the
performance of natural language processing algorithms. Similarly, it is common
to measure the quality of sentences generated by a natural language generation
model using human raters. In this paper, we argue for exploring the use of
subjective evaluations within the process of training language generation
models in a multi-task learning setting. As a case study, we use a
crowd-authored dialogue corpus to fine-tune six different language generation
models. Two of these models incorporate multi-task learning and use subjective
ratings of lines as part of an explicit learning goal. A human evaluation of
the generated dialogue lines reveals that utterances generated by the
multi-tasking models were subjectively rated as the most typical, most moving
the conversation forward, and least offensive. Based on these promising first
results, we discuss future research directions for incorporating subjective
human evaluations into language model training and to hence keep the human user
in the loop during the development process.
Related papers
- HLB: Benchmarking LLMs' Humanlikeness in Language Use [2.438748974410787]
We present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs)
We collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments.
Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels.
arXiv Detail & Related papers (2024-09-24T09:02:28Z) - DiverseDialogue: A Methodology for Designing Chatbots with Human-Like Diversity [5.388338680646657]
We show that GPT-4o mini, when used as simulated human participants, systematically differ from those between actual humans across multiple linguistic features.
We propose an approach that automatically generates prompts for user simulations by incorporating features derived from real human interactions.
Our method of prompt optimization, tailored to target specific linguistic features, shows significant improvements.
arXiv Detail & Related papers (2024-08-30T21:33:58Z) - DevBench: A multimodal developmental benchmark for language learning [0.34129029452670606]
We introduce DevBench, a benchmark for evaluating vision-language models on tasks and behavioral data.
We show that DevBench provides a benchmark for comparing models to human language development.
These comparisons highlight ways in which model and human language learning processes diverge.
arXiv Detail & Related papers (2024-06-14T17:49:41Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z) - Training Language Models with Natural Language Feedback [51.36137482891037]
We learn from language feedback on model outputs using a three-step learning algorithm.
In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements.
Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization.
arXiv Detail & Related papers (2022-04-29T15:06:58Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - Knowledge-Grounded Dialogue Generation with Pre-trained Language Models [74.09352261943911]
We study knowledge-grounded dialogue generation with pre-trained language models.
We propose equipping response generation defined by a pre-trained language model with a knowledge selection module.
arXiv Detail & Related papers (2020-10-17T16:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.