Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification
- URL: http://arxiv.org/abs/2507.15557v1
- Date: Mon, 21 Jul 2025 12:38:07 GMT
- Title: Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification
- Authors: Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, Alexander Panchenko,
- Abstract summary: We perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages.<n>We assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches.<n>Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline.
- Score: 66.69370876902222
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite recent progress in large language models (LLMs), evaluation of text generation tasks such as text style transfer (TST) remains a significant challenge. Recent studies (Dementieva et al., 2024; Pauli et al., 2025) revealed a substantial gap between automatic metrics and human judgments. Moreover, most prior work focuses exclusively on English, leaving multilingual TST evaluation largely unexplored. In this paper, we perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages: English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, Amharic. Drawing inspiration from the machine translation, we assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches. Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline in the text detoxification case.
Related papers
- Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics? [9.234136424254261]
Text style transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content.<n>Using human evaluation is ideal but costly, as is common in other natural language processing (NLP) tasks.<n>In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation.
arXiv Detail & Related papers (2025-02-07T07:39:17Z) - Multilingual and Explainable Text Detoxification with Parallel Corpora [58.83211571400692]
We extend parallel text detoxification corpus to new languages.<n>We conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences.<n>We then experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach.
arXiv Detail & Related papers (2024-12-16T12:08:59Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - Exploring Methods for Cross-lingual Text Style Transfer: The Case of
Text Detoxification [77.45995868988301]
Text detoxification is the task of transferring the style of text from toxic to neutral.
We present a large-scale study of strategies for cross-lingual text detoxification.
arXiv Detail & Related papers (2023-11-23T11:40:28Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Evaluating the Evaluation Metrics for Style Transfer: A Case Study in
Multilingual Formality Transfer [11.259786293913606]
This work is the first multilingual evaluation of metrics in style transfer (ST)
We evaluate leading ST automatic metrics on the oft-researched task of formality style transfer.
We identify several models that correlate well with human judgments and are robust across languages.
arXiv Detail & Related papers (2021-10-20T17:21:09Z) - Methods for Detoxification of Texts for the Russian Language [55.337471467610094]
We introduce the first study of automatic detoxification of Russian texts to combat offensive language.
We test two types of models - unsupervised approach that performs local corrections and supervised approach based on pretrained language GPT-2 model.
The results show that the tested approaches can be successfully used for detoxification, although there is room for improvement.
arXiv Detail & Related papers (2021-05-19T10:37:44Z) - XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation [93.80733419450225]
This paper analyzes the current state of cross-lingual transfer learning.
We extend XTREME to XTREME-R, which consists of an improved set of ten natural language understanding tasks.
arXiv Detail & Related papers (2021-04-15T12:26:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.