UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation
- URL: http://arxiv.org/abs/2504.20500v1
- Date: Tue, 29 Apr 2025 07:40:00 GMT
- Title: UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation
- Authors: Huimin Lu, Masaru Isonuma, Junichiro Mori, Ichiro Sakata,
- Abstract summary: UniDetox is a method designed to mitigate toxicity across various large language models (LLMs)<n>We propose a novel and efficient dataset distillation technique for detoxification using contrastive decoding.<n>Experiments demonstrate that the detoxifying text distilled from GPT-2 can effectively detoxify larger models, including OPT, Falcon, and LLaMA-2.
- Score: 18.150899267807965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present UniDetox, a universally applicable method designed to mitigate toxicity across various large language models (LLMs). Previous detoxification methods are typically model-specific, addressing only individual models or model families, and require careful hyperparameter tuning due to the trade-off between detoxification efficacy and language modeling performance. In contrast, UniDetox provides a detoxification technique that can be universally applied to a wide range of LLMs without the need for separate model-specific tuning. Specifically, we propose a novel and efficient dataset distillation technique for detoxification using contrastive decoding. This approach distills detoxifying representations in the form of synthetic text data, enabling universal detoxification of any LLM through fine-tuning with the distilled text. Our experiments demonstrate that the detoxifying text distilled from GPT-2 can effectively detoxify larger models, including OPT, Falcon, and LLaMA-2. Furthermore, UniDetox eliminates the need for separate hyperparameter tuning for each model, as a single hyperparameter configuration can be seamlessly applied across different models. Additionally, analysis of the detoxifying text reveals a reduction in politically biased content, providing insights into the attributes necessary for effective detoxification of LLMs.
Related papers
- SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators [61.82799141938912]
Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets.<n>We introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset.
arXiv Detail & Related papers (2025-02-10T12:30:25Z) - Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z) - Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
We show that editing a small subset of parameters can effectively modulate specific behaviors of large language models (LLMs)
Our approach achieves reductions of up to 90.0% in toxicity on the RealToxicityPrompts dataset and 49.2% on ToxiGen.
arXiv Detail & Related papers (2024-07-11T17:52:03Z) - Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity [6.786565820048478]
We introduce a tuning-free alignment alternative, ProFS, and demonstrate its effectiveness under the use case of toxicity reduction.<n>ProFS identifies a toxic subspace in the model parameter space and reduces model toxicity by projecting away the detected subspace.<n>We show that ProFS is more sample-efficient than DPO, further showcasing greater robustness to noisy data.
arXiv Detail & Related papers (2024-05-22T20:08:48Z) - Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs)
We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts.
We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z) - DetoxLLM: A Framework for Detoxification with Explanations [25.174878638472254]
We propose DetoxLLM, the first comprehensive end-to-end detoxification framework.
We first introduce a cross-platform pseudo-parallel corpus applying multi-step data processing and generation strategies.
We show that our detoxification models outperform the SoTA model trained with human-annotated parallel corpus.
arXiv Detail & Related papers (2024-02-25T01:56:47Z) - Parameter-Efficient Detoxification with Contrastive Decoding [78.5124331048714]
We introduce Detoxification Generator (DETOXIGEN), an inference-time algorithm that steers the generation away from unwanted styles.
During the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step.
We find that it significantly outperforms previous approaches in detoxification metrics while not compromising on the generation quality.
arXiv Detail & Related papers (2024-01-13T01:46:20Z) - CMD: a framework for Context-aware Model self-Detoxification [22.842468869653818]
Text detoxification aims to minimize the risk of language models producing toxic content.
Existing detoxification methods fail to achieve a decent balance between detoxification effectiveness and generation quality.
We introduce a Context-aware Model self-Detoxification(CMD) framework that pays attention to both the context and the detoxification process.
arXiv Detail & Related papers (2023-08-16T11:50:38Z) - CFL: Causally Fair Language Models Through Token-level Attribute
Controlled Generation [5.210143170392524]
We propose a method to control the attributes of Language Models (LMs) for the text generation task using Causal Average Treatment Effect (ATE) scores and counterfactual augmentation.
We explore this method, in the context of LM detoxification, and propose the Causally Fair Language (CFL) architecture for detoxifying pre-trained LMs in a plug-and-play manner.
arXiv Detail & Related papers (2023-06-01T06:13:51Z) - Exploring the Limits of Domain-Adaptive Training for Detoxifying
Large-Scale Language Models [84.30718841659531]
We explore domain-adaptive training to reduce the toxicity of language models.
For the training corpus, we propose to leverage the generative power of LMs.
We then comprehensively study LMs with parameter sizes ranging from 126M up to 530B, a scale that has never been studied before.
arXiv Detail & Related papers (2022-02-08T22:10:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.