Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph
- URL: http://arxiv.org/abs/2412.15268v2
- Date: Tue, 24 Dec 2024 04:38:57 GMT
- Title: Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph
- Authors: Yibo Zhao, Jiapeng Zhu, Can Xu, Xiang Li,
- Abstract summary: The absence of domain-specific toxic knowledge leads to false negatives.
The excessive sensitivity of Large Language Models to toxic speech results in false positives.
We propose a novel method called MetaTox, leveraging graph search on a meta-toxic knowledge graph to enhance hatred and toxicity detection.
- Score: 36.07351851458233
- License:
- Abstract: The rapid growth of social media platforms has raised significant concerns regarding online content toxicity. When Large Language Models (LLMs) are used for toxicity detection, two key challenges emerge: 1) the absence of domain-specific toxic knowledge leads to false negatives; 2) the excessive sensitivity of LLMs to toxic speech results in false positives, limiting freedom of speech. To address these issues, we propose a novel method called MetaTox, leveraging graph search on a meta-toxic knowledge graph to enhance hatred and toxicity detection. First, we construct a comprehensive meta-toxic knowledge graph by utilizing LLMs to extract toxic information through a three-step pipeline, with toxic benchmark datasets serving as corpora. Second, we query the graph via retrieval and ranking processes to supplement accurate, relevant toxic knowledge. Extensive experiments and in-depth case studies across multiple datasets demonstrate that our MetaTox significantly decreases the false positive rate while boosting overall toxicity detection performance. Our code will be available soon.
Related papers
- Efficient Detection of Toxic Prompts in Large Language Models [8.794371569341429]
Large language models (LLMs) can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses.
We propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs.
ToxicDetector achieves a high accuracy of 96.39% and a low false positive rate of 2.00%, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T15:54:04Z) - Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs)
We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts.
We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z) - Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use.
We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting.
We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z) - You Only Prompt Once: On the Capabilities of Prompt Learning on Large
Language Models to Tackle Toxic Content [13.600755614321493]
We investigate how we can use large language models (LLMs) to tackle the problem of toxic content online.
We focus on three tasks; 1) Toxicity Classification, 2) Toxic Span Detection, and 3) Detoxification.
We find that prompt learning achieves around 10% improvement in the toxicity classification task compared to the baselines.
arXiv Detail & Related papers (2023-08-10T14:14:13Z) - Facilitating Fine-grained Detection of Chinese Toxic Language:
Hierarchical Taxonomy, Resources, and Benchmarks [18.44630180661091]
Existing datasets lack fine-grained annotation of toxic types and expressions.
It is crucial to introduce lexical knowledge to detect the toxicity of posts.
In this paper, we facilitate the fine-grained detection of Chinese toxic language.
arXiv Detail & Related papers (2023-05-08T03:50:38Z) - Detoxifying Text with MaRCo: Controllable Revision with Experts and
Anti-Experts [57.38912708076231]
We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods.
MaRCo uses likelihoods under a non-toxic LM and a toxic LM to find candidate words to mask and potentially replace.
We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $times$ more in human evaluation.
arXiv Detail & Related papers (2022-12-20T18:50:00Z) - Toxicity Detection can be Sensitive to the Conversational Context [64.28043776806213]
We construct and publicly release a dataset of 10,000 posts with two kinds of toxicity labels.
We introduce a new task, context sensitivity estimation, which aims to identify posts whose perceived toxicity changes if the context is also considered.
arXiv Detail & Related papers (2021-11-19T13:57:26Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.