TrueTeacher: Learning Factual Consistency Evaluation with Large Language
Models
- URL: http://arxiv.org/abs/2305.11171v3
- Date: Wed, 18 Oct 2023 19:16:18 GMT
- Title: TrueTeacher: Learning Factual Consistency Evaluation with Large Language
Models
- Authors: Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and
Idan Szpektor
- Abstract summary: We introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries.
Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature.
- Score: 20.09470051458651
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Factual consistency evaluation is often conducted using Natural Language
Inference (NLI) models, yet these models exhibit limited success in evaluating
summaries. Previous work improved such models with synthetic training data.
However, the data is typically based on perturbed human-written summaries,
which often differ in their characteristics from real model-generated summaries
and have limited coverage of possible factual errors. Alternatively, large
language models (LLMs) have recently shown promising results in directly
evaluating generative tasks, but are too computationally expensive for
practical use. Motivated by these limitations, we introduce TrueTeacher, a
method for generating synthetic data by annotating diverse model-generated
summaries using a LLM. Unlike prior work, TrueTeacher does not rely on
human-written summaries, and is multilingual by nature. Experiments on the TRUE
benchmark show that a student model trained using our data, substantially
outperforms both the state-of-the-art model with similar capacity, and the LLM
teacher. In a systematic study, we compare TrueTeacher to existing synthetic
data generation methods and demonstrate its superiority and robustness to
domain-shift. We also show that our method generalizes to multilingual
scenarios. Lastly, we release our large scale synthetic dataset (1.4M
examples), generated using TrueTeacher, and a checkpoint trained on this data.
Related papers
- Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models [0.0]
We propose a method in which we use token-based and sentence-based augmentation methods to generate counterfactual sentence pairs.
We show that the proposed method can improve the performance and robustness of the NLI model.
arXiv Detail & Related papers (2024-10-28T03:43:25Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Adapting Large Language Models for Content Moderation: Pitfalls in Data
Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains.
In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Correcting Diverse Factual Errors in Abstractive Summarization via
Post-Editing and Language Model Infilling [56.70682379371534]
We show that our approach vastly outperforms prior methods in correcting erroneous summaries.
Our model -- FactEdit -- improves factuality scores by over 11 points on CNN/DM and over 31 points on XSum.
arXiv Detail & Related papers (2022-10-22T07:16:19Z) - Falsesum: Generating Document-level NLI Examples for Recognizing Factual
Inconsistency in Summarization [63.21819285337555]
We show that NLI models can be effective for this task when the training data is augmented with high-quality task-oriented examples.
We introduce Falsesum, a data generation pipeline leveraging a controllable text generation model to perturb human-annotated summaries.
We show that models trained on a Falsesum-augmented NLI dataset improve the state-of-the-art performance across four benchmarks for detecting factual inconsistency in summarization.
arXiv Detail & Related papers (2022-05-12T10:43:42Z) - Evaluation of HTR models without Ground Truth Material [2.4792948967354236]
evaluation of Handwritten Text Recognition models during their development is straightforward.
But the evaluation process becomes tricky as soon as we switch from development to application.
We show that lexicon-based evaluation can compete with lexicon-based methods.
arXiv Detail & Related papers (2022-01-17T01:26:09Z) - On the Evaluation of Commit Message Generation Models: An Experimental
Study [33.19314967188712]
Commit messages are natural language descriptions of code changes, which are important for program understanding and maintenance.
Various approaches utilizing generation or retrieval techniques have been proposed to automatically generate commit messages.
This paper conducts a systematic and in-depth analysis of the state-of-the-art models and datasets.
arXiv Detail & Related papers (2021-07-12T12:38:02Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.