Related papers: Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification

Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification

URL: http://arxiv.org/abs/2601.06226v1
Date: Fri, 09 Jan 2026 09:34:53 GMT
Title: Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification
Authors: Zenghao Duan, Zhiyi Yin, Zhichao Shi, Liang Pang, Shaoling Jing, Zihe Huang, Jiayi Wu, Yu Yan, Jingcheng Deng, Huawei Shen, Xueqi Cheng,
Abstract summary: Large language models (LLMs) exhibit exceptional performance but pose inherent risks of generating toxic content.<n>Traditional methods fail to eliminate underlying toxic regions in parameters, leaving models vulnerable to adversarial attacks.<n>We propose GLOSS, a lightweight method that mitigates toxicity by identifying and eliminating this global subspace from FFN parameters.
Score: 73.77171973106567
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) exhibit exceptional performance but pose inherent risks of generating toxic content, restricting their safe deployment. While traditional methods (e.g., alignment) adjust output preferences, they fail to eliminate underlying toxic regions in parameters, leaving models vulnerable to adversarial attacks. Prior mechanistic studies characterize toxic regions as "toxic vectors" or "layer-wise subspaces", yet our analysis identifies critical limitations: i) Removed toxic vectors can be reconstructed via linear combinations of non-toxic vectors, demanding targeting of entire toxic subspace; ii) Contrastive objective over limited samples inject noise into layer-wise subspaces, hindering stable extraction. These highlight the challenge of identifying robust toxic subspace and removing them. Therefore, we propose GLOSS (GLobal tOxic Subspace Suppression), a lightweight method that mitigates toxicity by identifying and eliminating this global subspace from FFN parameters. Experiments on LLMs (e.g., Qwen3) show GLOSS achieves SOTA detoxification while preserving general capabilities without requiring large-scale retraining. WARNING: This paper contains context which is toxic in nature.

Related papers

Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention [6.808534332444413]
Large Language Models (LLMs) are powerful text generators.<n>LLMs can produce toxic or harmful content even when given seemingly harmless prompts.<n>This presents a serious safety challenge and can cause real-world harm.
arXiv Detail & Related papers (2026-02-06T11:33:17Z)
Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing [77.75609817898035]
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content.<n>We propose textscAutoregressive textscReward textscGuided textscRe presentation textscEditing (ARGRE)<n>ARGRE explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing.
arXiv Detail & Related papers (2025-09-24T03:40:32Z)
GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace [62.68664365246247]
This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs)<n>We propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN.
arXiv Detail & Related papers (2025-05-20T08:29:11Z)
Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs) SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z)
Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z)
Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings [8.720903734757627]
Large pre-trained language models are often trained on large volumes of internet data, some of which may contain toxic or abusive language. Current methods aim to prevent toxic features from appearing generated text. We hypothesize the existence of a low-dimensional toxic subspace in the latent space of pre-trained language models.
arXiv Detail & Related papers (2021-12-15T18:54:34Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.