Related papers: Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

URL: http://arxiv.org/abs/2006.03189v1
Date: Fri, 5 Jun 2020 00:57:52 GMT
Title: Human or Machine: Automating Human Likeliness Evaluation of NLG Texts
Authors: Erion \c{C}ano and Ond\v{r}ej Bojar
Abstract summary: We propose to use a human likeliness score that shows the percentage of the output samples from a method that look as if they were written by a human. As follow up, we plan to perform an empirical analysis of human-written and machine-generated texts to find the optimal setup of this evaluation approach.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic evaluation of various text quality criteria produced by data-driven intelligent methods is very common and useful because it is cheap, fast, and usually yields repeatable results. In this paper, we present an attempt to automate the human likeliness evaluation of the output text samples coming from natural language generation methods used to solve several tasks. We propose to use a human likeliness score that shows the percentage of the output samples from a method that look as if they were written by a human. Instead of having human participants label or rate those samples, we completely automate the process by using a discrimination procedure based on large pretrained language models and their probability distributions. As follow up, we plan to perform an empirical analysis of human-written and machine-generated texts to find the optimal setup of this evaluation approach. A validation procedure involving human participants will also check how the automatic evaluation correlates with human judgments.

Related papers

Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning [63.531262595858]
Divide-and-conquer approach breaks comprehensive evaluation task into localized scoring tasks, followed by a final global assessment.<n>We introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations.<n>Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation.
arXiv Detail & Related papers (2025-05-26T16:39:41Z)
How to Select Datapoints for Efficient Human Evaluation of NLG Models? [57.60407340254572]
We develop a suite of selectors to get the most informative datapoints for human evaluation. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts.
arXiv Detail & Related papers (2025-01-30T10:33:26Z)
Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities [48.922660354417204]
We propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement. In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions.
arXiv Detail & Related papers (2024-03-17T07:34:12Z)
AutoEval Done Right: Using Synthetic Data for Model Evaluation [79.01454261157525]
We suggest efficient and statistically principled algorithms for this purpose. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.
arXiv Detail & Related papers (2024-03-09T02:47:11Z)
Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects. In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts. We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z)
Creating user stereotypes for persona development from qualitative data through semi-automatic subspace clustering [0.0]
We propose a method that employs the modelling of user stereotypes to automate part of the persona creation process. Results show that manual techniques differ between human persona designers leading to different results. The proposed algorithm provides similar results based on parameter input, but was more rigorous and will find optimal clusters.
arXiv Detail & Related papers (2023-06-26T09:49:51Z)
MISMATCH: Fine-grained Evaluation of Machine-generated Text with Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts. Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types. We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z)
Correction of Errors in Preference Ratings from Automated Metrics for Text Generation [4.661309379738428]
We propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics. We show that our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics.
arXiv Detail & Related papers (2023-06-06T17:09:29Z)
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation [35.8129864412223]
This paper proposes a standardized and well-defined human evaluation protocol. We experimentally show that the current automatic measures are incompatible with human perception. We provide insights for designing human evaluation experiments reliably and conclusively.
arXiv Detail & Related papers (2023-04-04T14:14:16Z)
Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z)
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint) It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis. TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)
How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics [47.20761880464552]
generative dialogue modeling is widely seen as a language modeling task. The task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent.
arXiv Detail & Related papers (2020-08-24T13:28:35Z)
Automating Text Naturalness Evaluation of NLG Systems [0.0]
We present an attempt to automate the evaluation of text naturalness. Instead of relying on human participants for scoring or labeling the text samples, we propose to automate the process. We analyze the text probability fractions and observe how they are influenced by the size of the generative and discriminative models involved in the process.
arXiv Detail & Related papers (2020-06-23T18:48:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.