Text Style Transfer Evaluation Using Large Language Models
- URL: http://arxiv.org/abs/2308.13577v2
- Date: Sat, 23 Sep 2023 06:05:22 GMT
- Title: Text Style Transfer Evaluation Using Large Language Models
- Authors: Phil Ostheimer, Mayank Nagda, Marius Kloft, Sophie Fellenz
- Abstract summary: Large Language Models (LLMs) have shown their capacity to match and even exceed average human performance.
We compare the results of different LLMs in TST using multiple input prompts.
Our findings highlight a strong correlation between (even zero-shot) prompting and human evaluation, showing that LLMs often outperform traditional automated metrics.
- Score: 24.64611983641699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating Text Style Transfer (TST) is a complex task due to its
multifaceted nature. The quality of the generated text is measured based on
challenging factors, such as style transfer accuracy, content preservation, and
overall fluency. While human evaluation is considered to be the gold standard
in TST assessment, it is costly and often hard to reproduce. Therefore,
automated metrics are prevalent in these domains. Nevertheless, it remains
unclear whether these automated metrics correlate with human evaluations.
Recent strides in Large Language Models (LLMs) have showcased their capacity to
match and even exceed average human performance across diverse, unseen tasks.
This suggests that LLMs could be a feasible alternative to human evaluation and
other automated metrics in TST evaluation. We compare the results of different
LLMs in TST using multiple input prompts. Our findings highlight a strong
correlation between (even zero-shot) prompting and human evaluation, showing
that LLMs often outperform traditional automated metrics. Furthermore, we
introduce the concept of prompt ensembling, demonstrating its ability to
enhance the robustness of TST evaluation. This research contributes to the
ongoing evaluation of LLMs in diverse tasks, offering insights into successful
outcomes and areas of limitation.
Related papers
- Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics? [9.234136424254261]
Text Style Transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content.
Using human evaluation is ideal but costly, same as in other natural language processing (NLP) tasks.
In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation.
arXiv Detail & Related papers (2025-02-07T07:39:17Z) - Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.
We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [46.949604465227054]
We propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition.
MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs.
The pairwise comparison results are then aggregated into a global ranking using the Elo rating system.
arXiv Detail & Related papers (2024-04-10T01:26:24Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.