Related papers: Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

URL: http://arxiv.org/abs/2505.16722v2
Date: Mon, 30 Jun 2025 22:55:54 GMT
Title: Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
Authors: Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen,
Abstract summary: "Cross-lingual Detoxification" is a paradigm that mitigates toxicity in large language models.<n>We analyze toxicity reduction in cross-distribution settings and investigate how mitigation impacts model performance on non-toxic tasks.
Score: 31.7516400680833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 392 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.

Related papers

GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace [62.68664365246247]
This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs)<n>We propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN.
arXiv Detail & Related papers (2025-05-20T08:29:11Z)
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation [6.781972039785424]
Recent generative large language models (LLMs) show remarkable performance in non-English languages.<n>When prompted in those languages they tend to express higher harmful social biases and toxicity levels.<n>We investigate the impact of different finetuning methods on the model's bias and toxicity, but also on its ability to produce fluent and diverse text.
arXiv Detail & Related papers (2024-12-18T17:05:08Z)
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models [27.996123856250065]
Existing toxicity benchmarks are overwhelmingly focused on English. We introduce PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages.
arXiv Detail & Related papers (2024-05-15T14:22:33Z)
Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs) We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z)
From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models [10.807067327137855]
As language models embrace multilingual capabilities, it's crucial our safety measures keep pace. In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhance our mitigation techniques. This allows us to examine the effects of translation quality and the cross-lingual transfer on toxicity mitigation.
arXiv Detail & Related papers (2024-03-06T17:51:43Z)
Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z)
Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification [77.45995868988301]
Text detoxification is the task of transferring the style of text from toxic to neutral. We present a large-scale study of strategies for cross-lingual text detoxification.
arXiv Detail & Related papers (2023-11-23T11:40:28Z)
Language Detoxification with Attribute-Discriminative Latent Space [59.167432249229584]
Transformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks. They can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications. We propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space.
arXiv Detail & Related papers (2022-10-19T06:54:42Z)
Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models [78.12943085697283]
Detoxification is a task of generating text in polite style while preserving meaning and fluency of the original toxic text. This work investigates multilingual and cross-lingual detoxification and the behavior of large multilingual models like in this setting.
arXiv Detail & Related papers (2022-06-05T20:02:30Z)
Reward Modeling for Mitigating Toxicity in Transformer-based Language Models [0.0]
Transformer-based language models are able to generate fluent text and be efficiently adapted across various natural language generation tasks. Language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors. We propose Reinforce-Detoxify; A reinforcement learning-based method for mitigating toxicity in language models.
arXiv Detail & Related papers (2022-02-19T19:26:22Z)
Detoxifying Language Models Risks Marginalizing Minority Voices [40.918564746367586]
Language models (LMs) must be both safe and equitable to be responsibly deployed in practice. detoxification techniques have been proposed to mitigate toxic LM generations. We show that current detoxification techniques hurt equity: they decrease the utility of LMs on language used by marginalized groups.
arXiv Detail & Related papers (2021-04-13T17:52:01Z)
Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language. We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.