Text Style Transfer Evaluation Using Large Language Models
- URL: http://arxiv.org/abs/2308.13577v2
- Date: Sat, 23 Sep 2023 06:05:22 GMT
- Title: Text Style Transfer Evaluation Using Large Language Models
- Authors: Phil Ostheimer, Mayank Nagda, Marius Kloft, Sophie Fellenz
- Abstract summary: Large Language Models (LLMs) have shown their capacity to match and even exceed average human performance.
We compare the results of different LLMs in TST using multiple input prompts.
Our findings highlight a strong correlation between (even zero-shot) prompting and human evaluation, showing that LLMs often outperform traditional automated metrics.
- Score: 24.64611983641699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating Text Style Transfer (TST) is a complex task due to its
multifaceted nature. The quality of the generated text is measured based on
challenging factors, such as style transfer accuracy, content preservation, and
overall fluency. While human evaluation is considered to be the gold standard
in TST assessment, it is costly and often hard to reproduce. Therefore,
automated metrics are prevalent in these domains. Nevertheless, it remains
unclear whether these automated metrics correlate with human evaluations.
Recent strides in Large Language Models (LLMs) have showcased their capacity to
match and even exceed average human performance across diverse, unseen tasks.
This suggests that LLMs could be a feasible alternative to human evaluation and
other automated metrics in TST evaluation. We compare the results of different
LLMs in TST using multiple input prompts. Our findings highlight a strong
correlation between (even zero-shot) prompting and human evaluation, showing
that LLMs often outperform traditional automated metrics. Furthermore, we
introduce the concept of prompt ensembling, demonstrating its ability to
enhance the robustness of TST evaluation. This research contributes to the
ongoing evaluation of LLMs in diverse tasks, offering insights into successful
outcomes and areas of limitation.
Related papers
- Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models [47.545382591646565]
Large Language Models (LLMs) have excelled at language understanding and generating human-level text.
LLMs are susceptible to adversarial attacks where malicious users prompt the model to generate undesirable text.
In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs.
arXiv Detail & Related papers (2024-08-07T17:11:34Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [46.949604465227054]
We propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition.
MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs.
The pairwise comparison results are then aggregated into a global ranking using the Elo rating system.
arXiv Detail & Related papers (2024-04-10T01:26:24Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Evaluation Metrics of Language Generation Models for Synthetic Traffic
Generation Tasks [22.629816738693254]
We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG)
We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
arXiv Detail & Related papers (2023-11-21T11:26:26Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.