Measuring Attribution in Natural Language Generation Models
- URL: http://arxiv.org/abs/2112.12870v1
- Date: Thu, 23 Dec 2021 22:33:20 GMT
- Title: Measuring Attribution in Natural Language Generation Models
- Authors: Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Michael Collins,
Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, David Reitter
- Abstract summary: We present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models.
We first define AIS and introduce a two-stage annotation pipeline for allowing annotators to appropriately evaluate model output.
- Score: 14.931889185122213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With recent improvements in natural language generation (NLG) models for
various applications, it has become imperative to have the means to identify
and evaluate whether NLG output is only sharing verifiable information about
the external world. In this work, we present a new evaluation framework
entitled Attributable to Identified Sources (AIS) for assessing the output of
natural language generation models, when such output pertains to the external
world. We first define AIS and introduce a two-stage annotation pipeline for
allowing annotators to appropriately evaluate model output according to AIS
guidelines. We empirically validate this approach on three generation datasets
(two in the conversational QA domain and one in summarization) via human
evaluation studies that suggest that AIS could serve as a common framework for
measuring whether model-generated statements are supported by underlying
sources. We release guidelines for the human evaluation studies.
Related papers
- Collective Constitutional AI: Aligning a Language Model with Public Input [20.95333081841239]
There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior.
We present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs.
We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input.
arXiv Detail & Related papers (2024-06-12T02:20:46Z) - Lessons from the Trenches on Reproducible Evaluation of Language Models [60.522749986793094]
We draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.
We present the Language Model Evaluation Harness (lm-eval), an open source library for independent, reproducible, and evaluation of language models.
arXiv Detail & Related papers (2024-05-23T16:50:49Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks.
Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries.
These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Reranking for Natural Language Generation from Logical Forms: A Study
based on Large Language Models [47.08364281023261]
Large language models (LLMs) have demonstrated impressive capabilities in natural language generation.
However, their output quality can be inconsistent, posing challenges for generating natural language from logical forms (LFs)
arXiv Detail & Related papers (2023-09-21T17:54:58Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z) - Learning to Compare for Better Training and Evaluation of Open Domain
Natural Language Generation Models [23.62054164511058]
We propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT.
While able to be trained in a fully self-supervised fashion, our model can be further fine-tuned with a little amount of human preference annotation.
arXiv Detail & Related papers (2020-02-12T15:52:21Z) - Stochastic Natural Language Generation Using Dependency Information [0.7995360025953929]
This article presents a corpus-based model for generating natural language text.
Our model encodes dependency relations from training data through a feature set, then produces a new dependency tree for a given meaning representation.
We show that our model produces high-quality utterances in aspects of informativeness and naturalness as well as quality.
arXiv Detail & Related papers (2020-01-12T09:40:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.