Related papers: MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages

MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages

URL: http://arxiv.org/abs/2404.02037v1
Date: Tue, 2 Apr 2024 15:32:32 GMT
Title: MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages
Authors: Daryna Dementieva, Nikolay Babakov, Alexander Panchenko,
Abstract summary: Text detoxification is a task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recent approaches for parallel text detoxification corpora collection -- ParaDetox and APPADIA -- were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language.
Score: 71.50809576484288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text detoxification is a textual style transfer (TST) task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recently, text detoxification methods found their applications in various task such as detoxification of Large Language Models (LLMs) (Leong et al., 2023; He et al., 2024; Tang et al., 2023) and toxic speech combating in social networks (Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023). All these applications are extremely important to ensure safe communication in modern digital worlds. However, the previous approaches for parallel text detoxification corpora collection -- ParaDetox (Logacheva et al., 2022) and APPADIA (Atwell et al., 2022) -- were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language. Then, we experiment with different text detoxification models -- from unsupervised baselines to LLMs and fine-tuned models on the presented parallel corpora -- showing the great benefit of parallel corpus presence to obtain state-of-the-art text detoxification models for any language.

Related papers

Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification [66.69370876902222]
We perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages.<n>We assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches.<n>Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline.
arXiv Detail & Related papers (2025-07-21T12:38:07Z)
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators [61.82799141938912]
Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. We introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset.
arXiv Detail & Related papers (2025-02-10T12:30:25Z)
Multilingual and Explainable Text Detoxification with Parallel Corpora [58.83211571400692]
We extend parallel text detoxification corpus to new languages. We conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences. We then experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach.
arXiv Detail & Related papers (2024-12-16T12:08:59Z)
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models [51.713448010799986]
We propose textbfToxic Subword textbfPruning (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs. ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously.
arXiv Detail & Related papers (2024-10-05T13:30:33Z)
SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification [41.94295877935867]
This paper presents a solution for the Multilingual Text Detoxification task in the PAN-2024 competition of the SmurfCat team. Using data augmentation through machine translation and a special filtering procedure, we collected an additional multilingual parallel dataset for text detoxification. We fine-tuned several multilingual sequence-to-sequence models, such as mT0 and Aya, on a text detoxification task.
arXiv Detail & Related papers (2024-07-07T17:19:34Z)
Text Detoxification as Style Transfer in English and Hindi [1.183205689022649]
This paper focuses on text detoxification, i.e., automatically converting toxic text into non-toxic text. We present three approaches: knowledge transfer from a similar task, multi-task learning approach, and delete and reconstruct approach. Our results demonstrate that our approach effectively balances text detoxication while preserving the actual content and maintaining fluency.
arXiv Detail & Related papers (2024-02-12T16:30:41Z)
Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification [77.45995868988301]
Text detoxification is the task of transferring the style of text from toxic to neutral. We present a large-scale study of strategies for cross-lingual text detoxification.
arXiv Detail & Related papers (2023-11-23T11:40:28Z)
Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models [78.12943085697283]
Detoxification is a task of generating text in polite style while preserving meaning and fluency of the original toxic text. This work investigates multilingual and cross-lingual detoxification and the behavior of large multilingual models like in this setting.
arXiv Detail & Related papers (2022-06-05T20:02:30Z)
Russian Texts Detoxification with Levenshtein Editing [0.0]
We build a two-step tagging-based detoxification model using a parallel corpus of Russian texts. We achieve the best style transfer accuracy among all models in the RUSSE Detox shared task, surpassing larger sequence-to-sequence models.
arXiv Detail & Related papers (2022-04-28T16:58:17Z)
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.