Concept-Based Interpretability for Toxicity Detection
- URL: http://arxiv.org/abs/2511.16689v1
- Date: Sat, 15 Nov 2025 14:53:23 GMT
- Title: Concept-Based Interpretability for Toxicity Detection
- Authors: Samarth Garg, Deeksha Varshney, Divya Singh,
- Abstract summary: disproportionate attribution of concepts towards the target class often results in classification errors.<n>Our work introduces an interpretability technique based on the Concept Gradient (CG) method.<n>We propose the curation of toxic words that contribute to misclassifications in text classification models.
- Score: 3.66086430438787
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The rise of social networks has not only facilitated communication but also allowed the spread of harmful content. Although significant advances have been made in detecting toxic language in textual data, the exploration of concept-based explanations in toxicity detection remains limited. In this study, we leverage various subtype attributes present in toxicity detection datasets, such as obscene, threat, insult, identity attack, and sexual explicit as concepts that serve as strong indicators to identify whether language is toxic. However, disproportionate attribution of concepts towards the target class often results in classification errors. Our work introduces an interpretability technique based on the Concept Gradient (CG) method which provides a more causal interpretation by measuring how changes in concepts directly affect the output of the model. This is an extension of traditional gradient-based methods in machine learning, which often focus solely on input features. We propose the curation of Targeted Lexicon Set, which captures toxic words that contribute to misclassifications in text classification models. To assess the significance of these lexicon sets in misclassification, we compute Word-Concept Alignment (WCA) scores, which quantify the extent to which these words lead to errors due to over-attribution to toxic concepts. Finally, we introduce a lexicon-free augmentation strategy by generating toxic samples that exclude predefined toxic lexicon sets. This approach allows us to examine whether over-attribution persists when explicit lexical overlap is removed, providing insights into the model's attribution on broader toxic language patterns.
Related papers
- Unveiling Covert Toxicity in Multimodal Data via Toxicity Association Graphs: A Graph-Based Metric and Interpretable Detection Framework [58.01529356381494]
We propose a novel detection framework based on Toxicity Association Graphs (TAGs)<n>We introduce the first quantifiable metric for hidden toxicity, the Multimodal Toxicity Covertness (MTC)<n>Our approach enables precise identification of covert toxicity while preserving full interpretability of the decision-making process.
arXiv Detail & Related papers (2026-02-03T08:54:25Z) - Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective [104.09817371557476]
Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks.<n>Their potential to generate harmful content has raised serious safety concerns.<n>We introduce three novel multi-label benchmarks for toxicity detection.
arXiv Detail & Related papers (2025-10-16T06:50:33Z) - Mimicking How Humans Interpret Out-of-Context Sentences Through Controlled Toxicity Decoding [1.4963011898406866]
This paper aims to simulate how readers perceive content with varying toxicity levels by generating diverse interpretations of out-of-context sentences.<n>By modeling toxicity, we can anticipate misunderstandings and reveal hidden toxic meanings.
arXiv Detail & Related papers (2025-03-11T08:16:31Z) - ToXCL: A Unified Framework for Toxic Speech Detection and Explanation [3.803993344850168]
ToXCL is a unified framework for the detection and explanation of implicit toxic speech.
ToXCL achieves new state-of-the-art effectiveness, and outperforms baselines significantly.
arXiv Detail & Related papers (2024-03-25T12:21:38Z) - Language Model Classifier Aligns Better with Physician Word Sensitivity
than XGBoost on Readmission Prediction [86.15787587540132]
We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level.
Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores.
arXiv Detail & Related papers (2022-11-13T23:59:11Z) - Toxicity Detection with Generative Prompt-based Inference [3.9741109244650823]
It is a long-known risk that language models (LMs), once trained on corpus containing undesirable content, have the power to manifest biases and toxicity.
In this work, we explore the generative variant of zero-shot prompt-based toxicity detection with comprehensive trials on prompt engineering.
arXiv Detail & Related papers (2022-05-24T22:44:43Z) - LexSubCon: Integrating Knowledge from Lexical Resources into Contextual
Embeddings for Lexical Substitution [76.615287796753]
We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models.
This is achieved by combining contextual information with knowledge from structured lexical resources.
Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets.
arXiv Detail & Related papers (2021-07-11T21:25:56Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.