Related papers: Toxicity Detection for Free

Toxicity Detection for Free

URL: http://arxiv.org/abs/2405.18822v2
Date: Fri, 08 Nov 2024 01:49:58 GMT
Title: Toxicity Detection for Free
Authors: Zhanhao Hu, Julien Piet, Geng Zhao, Jiantao Jiao, David Wagner,
Abstract summary: We introduce Moderation Using LLM Introspection (MULI), which detects toxic prompts using the information extracted directly from LLMs themselves. We build a robust detector of toxic prompts using a sparse logistic regression model on the first response token logits.
Score: 16.07605369484645
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current LLMs are generally aligned to follow safety requirements and tend to refuse toxic prompts. However, LLMs can fail to refuse toxic prompts or be overcautious and refuse benign examples. In addition, state-of-the-art toxicity detectors have low TPRs at low FPR, incurring high costs in real-world applications where toxic examples are rare. In this paper, we introduce Moderation Using LLM Introspection (MULI), which detects toxic prompts using the information extracted directly from LLMs themselves. We found we can distinguish between benign and toxic prompts from the distribution of the first response token's logits. Using this idea, we build a robust detector of toxic prompts using a sparse logistic regression model on the first response token logits. Our scheme outperforms SOTA detectors under multiple metrics.

Related papers

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users [50.18141341939909]
We describe a vulnerability in language models trained with user feedback.<n>A single user can persistently alter LM knowledge and behavior.<n>We show that this attack can be used to insert factual knowledge the model did not previously possess.
arXiv Detail & Related papers (2025-07-03T17:55:40Z)
Human-Aligned Faithfulness in Toxicity Explanations of LLMs [20.993979880805487]
We develop a novel criterion to measure the extent to which free-form toxicity explanations align with those of a rational human under ideal conditions.<n>We conduct experiments on three Llama models and an 8B Ministral model on five diverse toxicity datasets.<n>Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances.
arXiv Detail & Related papers (2025-06-23T20:41:45Z)
Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing [49.85884082568318]
ToxEdit is a toxicity-aware knowledge editing approach.<n>It dynamically detects toxic activation patterns during forward propagation.<n>It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively.
arXiv Detail & Related papers (2025-05-28T12:37:06Z)
Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph [36.07351851458233]
The absence of domain-specific toxic knowledge leads to false negatives. The excessive sensitivity of Large Language Models to toxic speech results in false positives. We propose a novel method called MetaTox, leveraging graph search on a meta-toxic knowledge graph to enhance hatred and toxicity detection.
arXiv Detail & Related papers (2024-12-17T06:28:28Z)
Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs) SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z)
Efficient Detection of Toxic Prompts in Large Language Models [8.794371569341429]
Large language models (LLMs) can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. We propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector achieves a high accuracy of 96.39% and a low false positive rate of 2.00%, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T15:54:04Z)
OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families.
arXiv Detail & Related papers (2024-05-31T15:44:33Z)
Realistic Evaluation of Toxicity in Large Language Models [28.580995165272086]
Large language models (LLMs) have become integral to our professional and daily lives. The huge amount of data which endows them with vast and diverse knowledge exposes them to the inevitable toxicity and bias. This paper introduces the new Thoroughly Engineered Toxicity dataset, comprising manually crafted prompts.
arXiv Detail & Related papers (2024-05-17T09:42:59Z)
Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs) We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z)
Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z)
Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts [57.38912708076231]
We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods. MaRCo uses likelihoods under a non-toxic LM and a toxic LM to find candidate words to mask and potentially replace. We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $times$ more in human evaluation.
arXiv Detail & Related papers (2022-12-20T18:50:00Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.