Related papers: LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification

LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification

URL: http://arxiv.org/abs/2506.01484v2
Date: Fri, 06 Jun 2025 06:45:27 GMT
Title: LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification
Authors: Shuzhou Yuan, Ercong Nie, Lukas Kouba, Ashish Yashwanth Kangen, Helmut Schmid, Hinrich Schütze, Michael Färber,
Abstract summary: High-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation.<n>We propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification.<n>We release ParaDeHate as a benchmark of over 8K hate/non-hate text pairs and evaluate a wide range of baseline methods.<n> Experimental results show that models such as BART, fine-tuned on ParaDeHate, achieve better performance in style accuracy, content preservation, and fluency.
Score: 44.86106619757571
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Detoxification, the task of rewriting harmful language into non-toxic text, has become increasingly important amid the growing prevalence of toxic content online. However, high-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation. In this paper, we propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification. We first replicate the ParaDetox pipeline by replacing human annotators with an LLM and show that the LLM performs comparably to human annotation. Building on this, we construct ParaDeHate, a large-scale parallel dataset specifically for hatespeech detoxification. We release ParaDeHate as a benchmark of over 8K hate/non-hate text pairs and evaluate a wide range of baseline methods. Experimental results show that models such as BART, fine-tuned on ParaDeHate, achieve better performance in style accuracy, content preservation, and fluency, demonstrating the effectiveness of LLM-generated detoxification text as a scalable alternative to human annotation.

Related papers

Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing [49.85884082568318]
ToxEdit is a toxicity-aware knowledge editing approach.<n>It dynamically detects toxic activation patterns during forward propagation.<n>It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively.
arXiv Detail & Related papers (2025-05-28T12:37:06Z)
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators [61.82799141938912]
Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets.<n>We introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset.
arXiv Detail & Related papers (2025-02-10T12:30:25Z)
Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data [14.5729517924905]
This work presents a fine-tuning method that only uses non-parallel data to turn large language models (LLM) into a detoxification rewritter. Experiments indicate that the SRO-fine-tunned LLM achieves satisfying performance comparable to state-of-the-art models regarding style accuracy, content similarity, and fluency.
arXiv Detail & Related papers (2024-10-27T00:39:54Z)
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models [51.713448010799986]
We propose textbfToxic Subword textbfPruning (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs. ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously.
arXiv Detail & Related papers (2024-10-05T13:30:33Z)
Fine-Grained Detoxification via Instance-Level Prefixes for Large Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost. FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt. We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z)
CMD: a framework for Context-aware Model self-Detoxification [22.842468869653818]
Text detoxification aims to minimize the risk of language models producing toxic content. Existing detoxification methods fail to achieve a decent balance between detoxification effectiveness and generation quality. We introduce a Context-aware Model self-Detoxification(CMD) framework that pays attention to both the context and the detoxification process.
arXiv Detail & Related papers (2023-08-16T11:50:38Z)
Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts [57.38912708076231]
We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods. MaRCo uses likelihoods under a non-toxic LM and a toxic LM to find candidate words to mask and potentially replace. We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $times$ more in human evaluation.
arXiv Detail & Related papers (2022-12-20T18:50:00Z)
Language Detoxification with Attribute-Discriminative Latent Space [59.167432249229584]
Transformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks. They can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications. We propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space.
arXiv Detail & Related papers (2022-10-19T06:54:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.