Related papers: SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification

SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification

URL: http://arxiv.org/abs/2407.05449v2
Date: Wed, 10 Jul 2024 14:44:18 GMT
Title: SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification
Authors: Elisei Rykov, Konstantin Zaytsev, Ivan Anisimov, Alexandr Voronin,
Abstract summary: This paper presents a solution for the Multilingual Text Detoxification task in the PAN-2024 competition of the SmurfCat team. Using data augmentation through machine translation and a special filtering procedure, we collected an additional multilingual parallel dataset for text detoxification. We fine-tuned several multilingual sequence-to-sequence models, such as mT0 and Aya, on a text detoxification task.
Score: 41.94295877935867
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper presents a solution for the Multilingual Text Detoxification task in the PAN-2024 competition of the SmurfCat team. Using data augmentation through machine translation and a special filtering procedure, we collected an additional multilingual parallel dataset for text detoxification. Using the obtained data, we fine-tuned several multilingual sequence-to-sequence models, such as mT0 and Aya, on a text detoxification task. We applied the ORPO alignment technique to the final model. Our final model has only 3.7 billion parameters and achieves state-of-the-art results for the Ukrainian language and near state-of-the-art results for other languages. In the competition, our team achieved first place in the automated evaluation with a score of 0.52 and second place in the final human evaluation with a score of 0.74.

Related papers

ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting [0.0]
In this work, we introduce our solution for the Multilingual Text Detoxification Task in the PAN-2025 competition for the ylmmcl team.<n>Our approach departs from prior unsupervised or monolingual pipelines by leveraging explicit toxic word annotation via the multilingual_toxic_lexicon.<n>Our model achieves the highest STA (0.922) from our previous attempts, and an average official J score of 0.612 for toxic inputs in both the development and test sets.
arXiv Detail & Related papers (2025-07-24T19:38:15Z)
Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification [66.69370876902222]
We perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages.<n>We assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches.<n>Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline.
arXiv Detail & Related papers (2025-07-21T12:38:07Z)
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators [61.82799141938912]
Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. We introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset.
arXiv Detail & Related papers (2025-02-10T12:30:25Z)
LuxVeri at GenAI Detection Task 1: Inverse Perplexity Weighted Ensemble for Robust Detection of AI-Generated Text across English and Multilingual Contexts [0.8495482945981923]
This paper presents a system developed for Task 1 of the COLING 2025 Workshop on Detecting AI-Generated Content. Our approach utilizes an ensemble of models, with weights assigned according to each model's inverse perplexity, to enhance classification accuracy. Our results demonstrate the effectiveness of inverse perplexity weighting in improving the robustness of machine-generated text detection across both monolingual and multilingual settings.
arXiv Detail & Related papers (2025-01-21T06:32:32Z)
Multilingual and Explainable Text Detoxification with Parallel Corpora [58.83211571400692]
We extend parallel text detoxification corpus to new languages. We conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences. We then experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach.
arXiv Detail & Related papers (2024-12-16T12:08:59Z)
MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages [71.50809576484288]
Text detoxification is a task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recent approaches for parallel text detoxification corpora collection -- ParaDetox and APPADIA -- were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language.
arXiv Detail & Related papers (2024-04-02T15:32:32Z)
Text Detoxification as Style Transfer in English and Hindi [1.183205689022649]
This paper focuses on text detoxification, i.e., automatically converting toxic text into non-toxic text. We present three approaches: knowledge transfer from a similar task, multi-task learning approach, and delete and reconstruct approach. Our results demonstrate that our approach effectively balances text detoxication while preserving the actual content and maintaining fluency.
arXiv Detail & Related papers (2024-02-12T16:30:41Z)
Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification [77.45995868988301]
Text detoxification is the task of transferring the style of text from toxic to neutral. We present a large-scale study of strategies for cross-lingual text detoxification.
arXiv Detail & Related papers (2023-11-23T11:40:28Z)
Text2Topic: Multi-Label Text Classification System for Efficient Topic Detection in User Generated Content with Zero-Shot Capabilities [2.7311827519141363]
We propose Text to Topic (Text2Topic), which achieves high multi-label classification performance. Text2Topic supports zero-shot predictions, produces domain-specific text embeddings, and enables production-scale batch-inference. The model is deployed on a real-world stream processing platform, and it outperforms other models with 92.9% micro mAP.
arXiv Detail & Related papers (2023-10-23T11:33:24Z)
Tackling Low-Resourced Sign Language Translation: UPC at WMT-SLT 22 [4.382973957294345]
This paper describes the system developed at the Universitat Politecnica de Catalunya for the Workshop on Machine Translation 2022 Sign Language Translation Task. We use a Transformer model implemented with the Fairseq modeling toolkit. We have experimented with the vocabulary size, data augmentation techniques and pretraining the model with the ENIX-14T dataset.
arXiv Detail & Related papers (2022-12-02T12:42:24Z)
BJTU-WeChat's Systems for the WMT22 Chat Translation Task [66.81525961469494]
This paper introduces the joint submission of the Beijing Jiaotong University and WeChat AI to the WMT'22 chat translation task for English-German. Based on the Transformer, we apply several effective variants. Our systems achieve 0.810 and 0.946 COMET scores.
arXiv Detail & Related papers (2022-11-28T02:35:04Z)
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings. We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)
Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task. We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English. We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.