Related papers: Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation

Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation

URL: http://arxiv.org/abs/2509.16660v1
Date: Sat, 20 Sep 2025 12:21:52 GMT
Title: Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation
Authors: Zuhair Hasan Shaik, Abdullah Mazhar, Aseem Srivastava, Md Shad Akhtar,
Abstract summary: We investigate the stability of neuron-level toxicity indicators, the advantages of structural (layer-wise) representations, and the interpretability of mechanisms driving toxic generation.<n>We propose a novel principled intervention technique, EigenShift, based on eigen-decomposition of the language model's final output layer.
Score: 12.58703387927632
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models have demonstrated impressive fluency across diverse tasks, yet their tendency to produce toxic content remains a critical challenge for AI safety and public trust. Existing toxicity mitigation approaches primarily manipulate individual neuron activations, but these methods suffer from instability, context dependence, and often compromise the model's core language abilities. To address these shortcomings, we investigate three key questions: the stability of neuron-level toxicity indicators, the advantages of structural (layer-wise) representations, and the interpretability of mechanisms driving toxic generation. Through extensive experiments on Jigsaw and ToxiCN datasets, we show that aggregated layer-wise features provide more robust signals than single neurons. Moreover, we observe conceptual limitations in prior works that conflate toxicity detection experts and generation experts within neuron-based interventions. To mitigate this, we propose a novel principled intervention technique, EigenShift, based on eigen-decomposition of the language model's final output layer. This method selectively targets generation-aligned components, enabling precise toxicity suppression without impairing linguistic competence. Our method requires no additional training or fine-tuning, incurs minimal computational cost, and is grounded in rigorous theoretical analysis.

Related papers

Unveiling Covert Toxicity in Multimodal Data via Toxicity Association Graphs: A Graph-Based Metric and Interpretable Detection Framework [58.01529356381494]
We propose a novel detection framework based on Toxicity Association Graphs (TAGs)<n>We introduce the first quantifiable metric for hidden toxicity, the Multimodal Toxicity Covertness (MTC)<n>Our approach enables precise identification of covert toxicity while preserving full interpretability of the decision-making process.
arXiv Detail & Related papers (2026-02-03T08:54:25Z)
Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models [14.566005698357747]
Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms.<n>We introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content.<n>Our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.
arXiv Detail & Related papers (2026-01-16T21:01:26Z)
Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification [73.77171973106567]
Large language models (LLMs) exhibit exceptional performance but pose inherent risks of generating toxic content.<n>Traditional methods fail to eliminate underlying toxic regions in parameters, leaving models vulnerable to adversarial attacks.<n>We propose GLOSS, a lightweight method that mitigates toxicity by identifying and eliminating this global subspace from FFN parameters.
arXiv Detail & Related papers (2026-01-09T09:34:53Z)
Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective [104.09817371557476]
Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks.<n>Their potential to generate harmful content has raised serious safety concerns.<n>We introduce three novel multi-label benchmarks for toxicity detection.
arXiv Detail & Related papers (2025-10-16T06:50:33Z)
Toxicity in Online Platforms and AI Systems: A Survey of Needs, Challenges, Mitigations, and Future Directions [12.73085307172367]
The evolution of digital communication systems and the designs of online platforms have inadvertently facilitated the subconscious propagation of toxic behavior.<n>This survey attempts to generate a comprehensive taxonomy of toxicity from various perspectives.<n>It presents a holistic approach to explain the toxicity by understanding the context and environment that society is facing in the Artificial Intelligence era.
arXiv Detail & Related papers (2025-09-29T21:55:23Z)
<think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs [60.169913160819]
This paper explores the possibility of using synthetic toxic data as an alternative to human-generated data for training models for detoxification.<n>Experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data.<n>The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity.
arXiv Detail & Related papers (2025-09-10T07:48:24Z)
Redefining Toxicity: An Objective and Context-Aware Approach for Stress-Level-Based Detection [1.9424018922013224]
Most toxicity detection models treat toxicity as an intrinsic property of text, overlooking the role of context in shaping its impact.<n>We reconceptualise toxicity as a socially emergent stress signal.<n>We introduce a new framework for toxicity detection, including a formal definition and metric, and validate our approach on a novel dataset.
arXiv Detail & Related papers (2025-03-20T12:09:01Z)
Aligned Probing: Relating Toxic Behavior and Model Internals [66.49887503194101]
We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs)<n>Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time.<n>Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers.
arXiv Detail & Related papers (2025-03-17T17:23:50Z)
Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs) We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z)
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models [21.341749351654453]
The generation of toxic content by large language models (LLMs) remains a critical challenge for the safe deployment of language technology.<n>We propose a novel framework for implicit knowledge editing and controlled text generation by fine-tuning LLMs with a prototype-based contrastive perplexity objective.
arXiv Detail & Related papers (2024-01-16T16:49:39Z)
Toxicity Detection with Generative Prompt-based Inference [3.9741109244650823]
It is a long-known risk that language models (LMs), once trained on corpus containing undesirable content, have the power to manifest biases and toxicity. In this work, we explore the generative variant of zero-shot prompt-based toxicity detection with comprehensive trials on prompt engineering.
arXiv Detail & Related papers (2022-05-24T22:44:43Z)
Detoxifying Language Models with a Toxic Corpus [16.7345472998388]
We propose to use toxic corpus as an additional resource to reduce the toxicity. Our result shows that toxic corpus can indeed help to reduce the toxicity of the language generation process substantially.
arXiv Detail & Related papers (2022-04-30T18:25:18Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.