Automating Text Naturalness Evaluation of NLG Systems
- URL: http://arxiv.org/abs/2006.13268v1
- Date: Tue, 23 Jun 2020 18:48:33 GMT
- Title: Automating Text Naturalness Evaluation of NLG Systems
- Authors: Erion \c{C}ano and Ond\v{r}ej Bojar
- Abstract summary: We present an attempt to automate the evaluation of text naturalness.
Instead of relying on human participants for scoring or labeling the text samples, we propose to automate the process.
We analyze the text probability fractions and observe how they are influenced by the size of the generative and discriminative models involved in the process.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic methods and metrics that assess various quality criteria of
automatically generated texts are important for developing NLG systems because
they produce repeatable results and allow for a fast development cycle. We
present here an attempt to automate the evaluation of text naturalness which is
a very important characteristic of natural language generation methods. Instead
of relying on human participants for scoring or labeling the text samples, we
propose to automate the process by using a human likeliness metric we define
and a discrimination procedure based on large pretrained language models with
their probability distributions. We analyze the text probability fractions and
observe how they are influenced by the size of the generative and
discriminative models involved in the process. Based on our results, bigger
generators and larger pretrained discriminators are more appropriate for a
better evaluation of text naturalness. A comprehensive validation procedure
with human participants is required as follow up to check how well this
automatic evaluation scheme correlates with human judgments.
Related papers
- Evaluation Metrics of Language Generation Models for Synthetic Traffic
Generation Tasks [22.629816738693254]
We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG)
We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
arXiv Detail & Related papers (2023-11-21T11:26:26Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - Correction of Errors in Preference Ratings from Automated Metrics for
Text Generation [4.661309379738428]
We propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics.
We show that our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics.
arXiv Detail & Related papers (2023-06-06T17:09:29Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - Naturalness Evaluation of Natural Language Generation in Task-oriented
Dialogues using BERT [6.1478669848771546]
This paper presents an automatic method to evaluate the naturalness of natural language generation in dialogue systems.
By fine-tuning the BERT model, our proposed naturalness evaluation method shows robust results and outperforms the baselines.
arXiv Detail & Related papers (2021-09-07T08:40:14Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for
Token-level Evaluation Metrics [47.20761880464552]
generative dialogue modeling is widely seen as a language modeling task.
The task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user.
The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent.
arXiv Detail & Related papers (2020-08-24T13:28:35Z) - Human or Machine: Automating Human Likeliness Evaluation of NLG Texts [0.0]
We propose to use a human likeliness score that shows the percentage of the output samples from a method that look as if they were written by a human.
As follow up, we plan to perform an empirical analysis of human-written and machine-generated texts to find the optimal setup of this evaluation approach.
arXiv Detail & Related papers (2020-06-05T00:57:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.