Human or Machine: Automating Human Likeliness Evaluation of NLG Texts
- URL: http://arxiv.org/abs/2006.03189v1
- Date: Fri, 5 Jun 2020 00:57:52 GMT
- Title: Human or Machine: Automating Human Likeliness Evaluation of NLG Texts
- Authors: Erion \c{C}ano and Ond\v{r}ej Bojar
- Abstract summary: We propose to use a human likeliness score that shows the percentage of the output samples from a method that look as if they were written by a human.
As follow up, we plan to perform an empirical analysis of human-written and machine-generated texts to find the optimal setup of this evaluation approach.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic evaluation of various text quality criteria produced by data-driven
intelligent methods is very common and useful because it is cheap, fast, and
usually yields repeatable results. In this paper, we present an attempt to
automate the human likeliness evaluation of the output text samples coming from
natural language generation methods used to solve several tasks. We propose to
use a human likeliness score that shows the percentage of the output samples
from a method that look as if they were written by a human. Instead of having
human participants label or rate those samples, we completely automate the
process by using a discrimination procedure based on large pretrained language
models and their probability distributions. As follow up, we plan to perform an
empirical analysis of human-written and machine-generated texts to find the
optimal setup of this evaluation approach. A validation procedure involving
human participants will also check how the automatic evaluation correlates with
human judgments.
Related papers
- Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities [48.922660354417204]
We propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement.
In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions.
arXiv Detail & Related papers (2024-03-17T07:34:12Z) - AutoEval Done Right: Using Synthetic Data for Model Evaluation [79.01454261157525]
We suggest efficient and statistically principled algorithms for this purpose.
These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.
arXiv Detail & Related papers (2024-03-09T02:47:11Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - Creating user stereotypes for persona development from qualitative data
through semi-automatic subspace clustering [0.0]
We propose a method that employs the modelling of user stereotypes to automate part of the persona creation process.
Results show that manual techniques differ between human persona designers leading to different results.
The proposed algorithm provides similar results based on parameter input, but was more rigorous and will find optimal clusters.
arXiv Detail & Related papers (2023-06-26T09:49:51Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - Correction of Errors in Preference Ratings from Automated Metrics for
Text Generation [4.661309379738428]
We propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics.
We show that our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics.
arXiv Detail & Related papers (2023-06-06T17:09:29Z) - Toward Verifiable and Reproducible Human Evaluation for Text-to-Image
Generation [35.8129864412223]
This paper proposes a standardized and well-defined human evaluation protocol.
We experimentally show that the current automatic measures are incompatible with human perception.
We provide insights for designing human evaluation experiments reliably and conclusively.
arXiv Detail & Related papers (2023-04-04T14:14:16Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for
Token-level Evaluation Metrics [47.20761880464552]
generative dialogue modeling is widely seen as a language modeling task.
The task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user.
The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent.
arXiv Detail & Related papers (2020-08-24T13:28:35Z) - Automating Text Naturalness Evaluation of NLG Systems [0.0]
We present an attempt to automate the evaluation of text naturalness.
Instead of relying on human participants for scoring or labeling the text samples, we propose to automate the process.
We analyze the text probability fractions and observe how they are influenced by the size of the generative and discriminative models involved in the process.
arXiv Detail & Related papers (2020-06-23T18:48:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.