Detoxifying LLMs via Representation Erasure-Based Preference Optimization
- URL: http://arxiv.org/abs/2602.23391v1
- Date: Tue, 24 Feb 2026 22:51:06 GMT
- Title: Detoxifying LLMs via Representation Erasure-Based Preference Optimization
- Authors: Nazanin Mohammadi Sepahvand, Eleni Triantafillou, Hugo Larochelle, Doina Precup, Daniel M. Roy, Gintare Karolina Dziugaite,
- Abstract summary: Large language models (LLMs) trained on webscale data can produce toxic outputs.<n>Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations.<n>We propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem.
- Score: 44.29978832356216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful "directions" remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical: unlike baselines, REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Exhaustive evaluations show that REPO achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks-where existing representation- and output-based methods fail.
Related papers
- Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing [77.75609817898035]
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content.<n>We propose textscAutoregressive textscReward textscGuided textscRe presentation textscEditing (ARGRE)<n>ARGRE explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing.
arXiv Detail & Related papers (2025-09-24T03:40:32Z) - Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes [55.2480439325792]
Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics.<n>Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with outcomes, like scientific experiments.
arXiv Detail & Related papers (2025-08-15T20:50:53Z) - Graph Representation-based Model Poisoning on Federated Large Language Models [3.5233863453805143]
Federated large language models (FedLLMs) enable powerful generative capabilities within wireless networks while preserving data privacy.<n>This article first reviews recent advancements in model poisoning techniques and existing defense mechanisms for FedLLMs, underscoring critical limitations.<n>The article further investigates graph representation-based model poisoning (GRMP), an emerging attack paradigm that exploits higher-order correlations among benign client gradients to craft malicious updates indistinguishable from legitimate ones.
arXiv Detail & Related papers (2025-07-02T13:20:52Z) - Explicit Preference Optimization: No Need for an Implicit Reward Model [18.225409932618657]
Direct preference optimization (DPO) and its offshoots circumvent the need for a separate reward training step.<n>We show that DPO-based objectives are nonetheless subject to sub-optimal regularization and counter-intuitive artifacts.
arXiv Detail & Related papers (2025-06-09T07:11:01Z) - CART-based Synthetic Tabular Data Generation for Imbalanced Regression [1.342834401139078]
We propose adapting an existing CART-based synthetic data generation method, tailoring it for imbalanced regression.<n>The new method integrates relevance and density-based mechanisms to guide sampling in sparse regions of the target space.<n>Our experimental study focuses on the prediction of extreme target values across benchmark datasets.
arXiv Detail & Related papers (2025-06-03T12:42:20Z) - Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO [19.5712961932773]
We revisit direct preference optimization (DPO) and demonstrate that its loss theoretically admits a decomposed reformulation.<n>We introduce PRoximalized PReference Optimization (PRO), a unified method to align with diverse feeback types.
arXiv Detail & Related papers (2025-05-29T10:23:22Z) - Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.