Multidimensional Evaluation for Text Style Transfer Using ChatGPT
- URL: http://arxiv.org/abs/2304.13462v1
- Date: Wed, 26 Apr 2023 11:33:35 GMT
- Title: Multidimensional Evaluation for Text Style Transfer Using ChatGPT
- Authors: Huiyuan Lai, Antonio Toral, Malvina Nissim
- Abstract summary: We investigate the potential of ChatGPT as a multidimensional evaluator for the task of emphText Style Transfer
We test its performance on three commonly-used dimensions of text style transfer evaluation: style strength, content preservation, and fluency.
These preliminary results are expected to provide a first glimpse into the role of large language models in the multidimensional evaluation of stylized text generation.
- Score: 14.799109368073548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the potential of ChatGPT as a multidimensional evaluator for
the task of \emph{Text Style Transfer}, alongside, and in comparison to,
existing automatic metrics as well as human judgements. We focus on a zero-shot
setting, i.e. prompting ChatGPT with specific task instructions, and test its
performance on three commonly-used dimensions of text style transfer
evaluation: style strength, content preservation, and fluency. We perform a
comprehensive correlation analysis for two transfer directions (and overall) at
different levels. Compared to existing automatic metrics, ChatGPT achieves
competitive correlations with human judgments. These preliminary results are
expected to provide a first glimpse into the role of large language models in
the multidimensional evaluation of stylized text generation.
Related papers
- Unlocking Structure Measuring: Introducing PDD, an Automatic Metric for Positional Discourse Coherence [39.065349875944634]
We present a novel metric designed to quantify the discourse divergence between two long-form articles.
Our metric aligns more closely with human preferences and GPT-4 coherence evaluation, outperforming existing evaluation methods.
arXiv Detail & Related papers (2024-02-15T18:23:39Z) - Evaluation Metrics of Language Generation Models for Synthetic Traffic
Generation Tasks [22.629816738693254]
We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG)
We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
arXiv Detail & Related papers (2023-11-21T11:26:26Z) - Specializing Small Language Models towards Complex Style Transfer via
Latent Attribute Pre-Training [29.143887057933327]
We introduce the concept of complex text style transfer tasks, and constructed complex text datasets based on two widely applicable scenarios.
Our dataset is the first large-scale data set of its kind, with 700 rephrased sentences and 1,000 sentences from the game Genshin Impact.
arXiv Detail & Related papers (2023-09-19T21:01:40Z) - Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect
ChatGPT-Generated Text [48.36706154871577]
We introduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts)
It diverges from extant corpora by comprising pairs of human-written and ChatGPT-polished abstracts instead of purely ChatGPT-generated texts.
We also propose the "Polish Ratio" method, an innovative measure of the degree of modification made by ChatGPT compared to the original human-written text.
arXiv Detail & Related papers (2023-07-21T06:38:37Z) - MetricPrompt: Prompting Model as a Relevance Metric for Few-shot Text
Classification [65.51149771074944]
MetricPrompt eases verbalizer design difficulty by reformulating few-shot text classification task into text pair relevance estimation task.
We conduct experiments on three widely used text classification datasets across four few-shot settings.
Results show that MetricPrompt outperforms manual verbalizer and other automatic verbalizer design methods across all few-shot settings.
arXiv Detail & Related papers (2023-06-15T06:51:35Z) - ChatGPT vs Human-authored Text: Insights into Controllable Text
Summarization and Sentence Style Transfer [8.64514166615844]
We conduct a systematic inspection of ChatGPT's performance in two controllable generation tasks.
We evaluate the faithfulness of the generated text, and compare the model's performance with human-authored texts.
We observe that ChatGPT sometimes incorporates factual errors or hallucinations when adapting the text to suit a specific style.
arXiv Detail & Related papers (2023-06-13T14:21:35Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content [27.901155229342375]
We present a novel approach for detecting ChatGPT-generated vs. human-written text using language models.
Our models achieved remarkable results, with an accuracy of over 97% on the test dataset, as evaluated through various metrics.
arXiv Detail & Related papers (2023-05-13T17:12:11Z) - Exploring the Use of Large Language Models for Reference-Free Text
Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference.
The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Compression, Transduction, and Creation: A Unified Framework for
Evaluating Natural Language Generation [85.32991360774447]
Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives.
We propose a unifying perspective based on the nature of information change in NLG tasks.
We develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks.
arXiv Detail & Related papers (2021-09-14T01:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.