Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing
- URL: http://arxiv.org/abs/2510.01243v1
- Date: Wed, 24 Sep 2025 03:40:32 GMT
- Title: Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing
- Authors: Yisong Xiao, Aishan Liu, Siyuan Liang, Zonghao Ying, Xianglong Liu, Dacheng Tao,
- Abstract summary: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content.<n>We propose textscAutoregressive textscReward textscGuided textscRe presentation textscEditing (ARGRE)<n>ARGRE explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing.
- Score: 77.75609817898035
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content, necessitating detoxification strategies to ensure safe and responsible deployment. Test-time detoxification methods, which typically introduce static or dynamic interventions into LLM representations, offer a promising solution due to their flexibility and minimal invasiveness. However, current approaches often suffer from imprecise interventions, primarily due to their insufficient exploration of the transition space between toxic and non-toxic outputs. To address this challenge, we propose \textsc{A}utoregressive \textsc{R}eward \textsc{G}uided \textsc{R}epresentation \textsc{E}diting (ARGRE), a novel test-time detoxification framework that explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing. ARGRE identifies non-toxic semantic directions and interpolates between toxic and non-toxic representations to reveal fine-grained transition trajectories. These trajectories transform sparse toxicity annotations into dense training signals, enabling the construction of an autoregressive reward model that delivers stable and precise editing guidance. At inference, the reward model guides an adaptive two-step editing process to obtain detoxified representations: it first performs directional steering based on expected reward gaps to shift representations toward non-toxic regions, followed by lightweight gradient-based refinements. Extensive experiments across 8 widely used LLMs show that ARGRE significantly outperforms leading baselines in effectiveness (-62.21% toxicity) and efficiency (-47.58% inference time), while preserving the core capabilities of the original model with minimal degradation. Our code is available at the website.
Related papers
- Detoxifying LLMs via Representation Erasure-Based Preference Optimization [44.29978832356216]
Large language models (LLMs) trained on webscale data can produce toxic outputs.<n>Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations.<n>We propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem.
arXiv Detail & Related papers (2026-02-24T22:51:06Z) - Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention [6.808534332444413]
Large Language Models (LLMs) are powerful text generators.<n>LLMs can produce toxic or harmful content even when given seemingly harmless prompts.<n>This presents a serious safety challenge and can cause real-world harm.
arXiv Detail & Related papers (2026-02-06T11:33:17Z) - Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models [14.566005698357747]
Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms.<n>We introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content.<n>Our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.
arXiv Detail & Related papers (2026-01-16T21:01:26Z) - Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification [73.77171973106567]
Large language models (LLMs) exhibit exceptional performance but pose inherent risks of generating toxic content.<n>Traditional methods fail to eliminate underlying toxic regions in parameters, leaving models vulnerable to adversarial attacks.<n>We propose GLOSS, a lightweight method that mitigates toxicity by identifying and eliminating this global subspace from FFN parameters.
arXiv Detail & Related papers (2026-01-09T09:34:53Z) - Evolving Prompts for Toxicity Search in Large Language Models [3.2729350470429783]
ToxSearch is an evolutionary framework that tests model safety by evolving prompts in a steady-state loop.<n>We observe practically meaningful but attenuated cross-model transfer, with roughly halving toxicity on most targets.<n>These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming.
arXiv Detail & Related papers (2025-11-16T07:47:31Z) - Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective [104.09817371557476]
Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks.<n>Their potential to generate harmful content has raised serious safety concerns.<n>We introduce three novel multi-label benchmarks for toxicity detection.
arXiv Detail & Related papers (2025-10-16T06:50:33Z) - Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization [23.328207651816957]
The dissemination of toxic content on social media poses a serious threat to online environments and public discourse.<n>Existing approaches often struggle to simultaneously achieve strong detoxification performance, semantic preservation, and to out-of-distribution data.<n>We propose a two-stage training framework that jointly optimize for data efficiency, semantic preservation, and model generalization.
arXiv Detail & Related papers (2025-06-23T05:48:10Z) - Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing [49.85884082568318]
ToxEdit is a toxicity-aware knowledge editing approach.<n>It dynamically detects toxic activation patterns during forward propagation.<n>It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively.
arXiv Detail & Related papers (2025-05-28T12:37:06Z) - Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z) - Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
We show that editing a small subset of parameters can effectively modulate specific behaviors of large language models (LLMs)<n>Our approach achieves reductions of up to 90.0% in toxicity on the RealToxicityPrompts dataset and 49.2% on ToxiGen.
arXiv Detail & Related papers (2024-07-11T17:52:03Z) - DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion [16.989349884904943]
Current solutions involving finetuning or auxiliary models usually require extensive computational resources.
We propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs.
arXiv Detail & Related papers (2024-04-16T11:07:48Z) - Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs)
We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts.
We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.