Related papers: Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics

Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics

URL: http://arxiv.org/abs/2502.15022v3
Date: Thu, 12 Jun 2025 08:58:45 GMT
Title: Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics
Authors: Amalie Brogaard Pauli, Isabelle Augenstein, Ira Assent,
Abstract summary: This paper presents a large meta-evaluation of metrics for evaluating style and attribute transfer.<n>We find that meta-evaluation studies on existing datasets lead to misleading conclusions about the suitability of metrics for content preservation.<n>We introduce a new test set specifically designed for evaluating content preservation metrics for style transfer.
Score: 41.052284715017606
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) make it easy to rewrite a text in any style -- e.g. to make it more polite, persuasive, or more positive -- but evaluation thereof is not straightforward. A challenge lies in measuring content preservation: that content not attributable to style change is retained. This paper presents a large meta-evaluation of metrics for evaluating style and attribute transfer, focusing on content preservation. We find that meta-evaluation studies on existing datasets lead to misleading conclusions about the suitability of metrics for content preservation. Widely used metrics show a high correlation with human judgments despite being deemed unsuitable for the task -- because they do not abstract from style changes when evaluating content preservation. We show that the overly high correlations with human judgment stem from the nature of the test data. To address this issue, we introduce a new, challenging test set specifically designed for evaluating content preservation metrics for style transfer. Using this dataset, we demonstrate that suitable metrics for content preservation for style transfer indeed are style-aware. To support efficient evaluation, we propose a new style-aware method that utilises small language models, obtaining a higher alignment with human judgements than prompting a model of a similar size as an autorater.

Related papers

StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples [48.44036251656947]
Style representations aim to embed texts with similar writing styles closely and texts with different styles far apart, regardless of content.<n>We introduce StyleDistance, a novel approach to training stronger content-independent style embeddings.
arXiv Detail & Related papers (2024-10-16T17:25:25Z)
LMStyle Benchmark: Evaluating Text Style Transfer for Chatbots [0.0]
LMStyle Benchmark is an evaluation framework applicable to chat-style text style transfer (C-TST) In addition to style strength metrics, LMStyle Benchmark considers a novel aspect of metrics called appropriateness. Our experiments demonstrate that the new evaluation methods have a higher correlation with human judgments in terms of appropriateness.
arXiv Detail & Related papers (2024-03-13T20:19:30Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Few-shot Image Generation via Style Adaptation and Content Preservation [60.08988307934977]
We introduce an image translation module to GAN transferring, where the module teaches the generator to separate style and content. Our method consistently surpasses the state-of-the-art methods in few shot setting.
arXiv Detail & Related papers (2023-11-30T01:16:53Z)
Prefix-Tuning Based Unsupervised Text Style Transfer [29.86587278794342]
Unsupervised text style transfer aims at training a generative model that can alter the style of the input sentence while preserving its content. In this paper, we employ powerful pre-trained large language models and present a new prefix-tuning-based method for unsupervised text style transfer.
arXiv Detail & Related papers (2023-10-23T06:13:08Z)
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer [57.6482608202409]
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. We introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
arXiv Detail & Related papers (2023-08-29T17:36:02Z)
APPLS: Evaluating Evaluation Metrics for Plain Language Summarization [18.379461020500525]
This study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for Plain Language Summarization (PLS) We identify four PLS criteria from previous work and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations.
arXiv Detail & Related papers (2023-05-23T17:59:19Z)
Measuring the Measuring Tools: An Automatic Evaluation of Semantic Metrics for Text Corpora [5.254054636427663]
The ability to compare the semantic similarity between text corpora is important in a variety of natural language processing applications. We propose a set of automatic and interpretable measures for assessing the characteristics of corpus-level semantic similarity metrics.
arXiv Detail & Related papers (2022-11-29T14:47:07Z)
SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations. We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences. Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z)
TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z)
Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization [15.444996697848266]
A common approach is to map a given sentence to content representation that is free of style, and the content representation is fed to a decoder with a target style. Previous methods in filtering style completely remove tokens with style at the token level, which incurs the loss of content information. We propose to enhance content preservation by implicitly removing the style information of each token with reverse attention, and thereby retain the content.
arXiv Detail & Related papers (2021-08-01T12:54:46Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA) We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
Exploring Contextual Word-level Style Relevance for Unsupervised Style Transfer [60.07283363509065]
Unsupervised style transfer aims to change the style of an input sentence while preserving its original content. We propose a novel attentional sequence-to-sequence model that exploits the relevance of each output word to the target style. Experimental results show that our proposed model achieves state-of-the-art performance in terms of both transfer accuracy and content preservation.
arXiv Detail & Related papers (2020-05-05T10:24:28Z)
Politeness Transfer: A Tag and Generate Approach [167.9924201435888]
This paper introduces a new task of politeness transfer. It involves converting non-polite sentences to polite sentences while preserving the meaning. We design a tag and generate pipeline that identifies stylistic attributes and subsequently generates a sentence in the target style.
arXiv Detail & Related papers (2020-04-29T15:08:53Z)
Extending Text Informativeness Measures to Passage Interestingness Evaluation (Language Model vs. Word Embedding) [1.2998637003026272]
This paper defines the concept of Interestingness as a generalization of Informativeness. We then study the ability of state of the art Informativeness measures to cope with this generalization. We prove that the CLEF-INEX Tweet Contextualization 2012 Logarithm Similarity measure provides best results.
arXiv Detail & Related papers (2020-04-14T18:22:48Z)
Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer. In detail, the input is a set of structured records and a reference text for describing another recordset. The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.