GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages
- URL: http://arxiv.org/abs/2510.01250v1
- Date: Wed, 24 Sep 2025 10:06:40 GMT
- Title: GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages
- Authors: Trung Duc Anh Dang, Ferdinando Pio D'Elia,
- Abstract summary: We describe our submission to the PAN 2025 Multilingual Text Detoxification Challenge.<n>We apply parameter-efficient LoRA SFT fine-tuning and prompting techniques like few-shot and Chain-of-Thought.<n>Our system ranks first on high-resource and low-resource languages.
- Score: 32.22353317193898
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As social-media platforms emerge and evolve faster than the regulations meant to oversee them, automated detoxification might serve as a timely tool for moderators to enforce safe discourse at scale. We here describe our submission to the PAN 2025 Multilingual Text Detoxification Challenge, which rewrites toxic single-sentence inputs into neutral paraphrases across 15 typologically diverse languages. Building on a 12B-parameter Gemma-3 multilingual transformer, we apply parameter-efficient LoRA SFT fine-tuning and prompting techniques like few-shot and Chain-of-Thought. Our multilingual training corpus combines 3,600 human-authored parallel pairs, 21,600 machine-translated synthetic pairs, and model-generated pairs filtered by Jaccard thresholds. At inference, inputs are enriched with three LaBSE-retrieved neighbors and explicit toxic-span annotations. Evaluated via Style Transfer Accuracy, LaBSE-based semantic preservation, and xCOMET fluency, our system ranks first on high-resource and low-resource languages. Ablations show +0.081 joint score increase from few-shot examples and +0.088 from basic CoT prompting. ANOVA analysis identifies language resource status as the strongest predictor of performance ($\eta^2$ = 0.667, p < 0.01).
Related papers
- Text Detoxification in isiXhosa and Yorùbá: A Cross-Lingual Machine Learning Approach for Low-Resource African Languages [0.0]
Toxic language is one of the major barrier to safe online participation, yet robust mitigation tools are scarce for African languages.<n>This study investigates automatic text detoxification (toxic to neutral rewriting) for two low-resource African languages, isiXhosa and Yorb.
arXiv Detail & Related papers (2026-01-09T08:28:58Z) - ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting [0.0]
In this work, we introduce our solution for the Multilingual Text Detoxification Task in the PAN-2025 competition for the ylmmcl team.<n>Our approach departs from prior unsupervised or monolingual pipelines by leveraging explicit toxic word annotation via the multilingual_toxic_lexicon.<n>Our model achieves the highest STA (0.922) from our previous attempts, and an average official J score of 0.612 for toxic inputs in both the development and test sets.
arXiv Detail & Related papers (2025-07-24T19:38:15Z) - Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification [66.69370876902222]
We perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages.<n>We assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches.<n>Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline.
arXiv Detail & Related papers (2025-07-21T12:38:07Z) - Multilingual and Explainable Text Detoxification with Parallel Corpora [58.83211571400692]
We extend parallel text detoxification corpus to new languages.<n>We conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences.<n>We then experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach.
arXiv Detail & Related papers (2024-12-16T12:08:59Z) - SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification [41.94295877935867]
This paper presents a solution for the Multilingual Text Detoxification task in the PAN-2024 competition of the SmurfCat team.
Using data augmentation through machine translation and a special filtering procedure, we collected an additional multilingual parallel dataset for text detoxification.
We fine-tuned several multilingual sequence-to-sequence models, such as mT0 and Aya, on a text detoxification task.
arXiv Detail & Related papers (2024-07-07T17:19:34Z) - MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages [71.50809576484288]
Text detoxification is a task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register.
Recent approaches for parallel text detoxification corpora collection -- ParaDetox and APPADIA -- were explored only in monolingual setup.
In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language.
arXiv Detail & Related papers (2024-04-02T15:32:32Z) - Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech.
We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z) - Exploring Methods for Cross-lingual Text Style Transfer: The Case of
Text Detoxification [77.45995868988301]
Text detoxification is the task of transferring the style of text from toxic to neutral.
We present a large-scale study of strategies for cross-lingual text detoxification.
arXiv Detail & Related papers (2023-11-23T11:40:28Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.