Related papers: <think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs

So let's replace this phrase with insult... Lessons learned from generation of toxic texts with LLMs

URL: http://arxiv.org/abs/2509.08358v1
Date: Wed, 10 Sep 2025 07:48:24 GMT
Title: <think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs
Authors: Sergey Pletenev, Daniil Moskovskiy, Alexander Panchenko,
Abstract summary: This paper explores the possibility of using synthetic toxic data as an alternative to human-generated data for training models for detoxification.<n>Experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data.<n>The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity.
Score: 60.169913160819
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Modern Large Language Models (LLMs) are excellent at generating synthetic data. However, their performance in sensitive domains such as text detoxification has not received proper attention from the scientific community. This paper explores the possibility of using LLM-generated synthetic toxic data as an alternative to human-generated data for training models for detoxification. Using Llama 3 and Qwen activation-patched models, we generated synthetic toxic counterparts for neutral texts from ParaDetox and SST-2 datasets. Our experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data, with a drop in performance of up to 30% in joint metrics. The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity. These findings highlight the limitations of current LLMs in this domain and emphasize the continued importance of diverse, human-annotated data for building robust detoxification systems.

Related papers

Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models [14.566005698357747]
Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms.<n>We introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content.<n>Our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.
arXiv Detail & Related papers (2026-01-16T21:01:26Z)
Something Just Like TRuST : Toxicity Recognition of Span and Target [2.4169078025984825]
This paper introduces TRuST, a comprehensive dataset designed to improve toxicity detection.<n>We benchmark state-of-the-art large language models (LLMs) on toxicity detection, target group identification, and toxic span extraction.<n>We find that fine-tuned models consistently outperform zero-shot and few-shot prompting, though performance remains low for certain social groups.
arXiv Detail & Related papers (2025-06-02T23:48:16Z)
LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification [44.86106619757571]
High-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation.<n>We propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification.<n>We release ParaDeHate as a benchmark of over 8K hate/non-hate text pairs and evaluate a wide range of baseline methods.<n> Experimental results show that models such as BART, fine-tuned on ParaDeHate, achieve better performance in style accuracy, content preservation, and fluency.
arXiv Detail & Related papers (2025-06-02T09:45:05Z)
GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace [62.68664365246247]
This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs)<n>We propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN.
arXiv Detail & Related papers (2025-05-20T08:29:11Z)
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators [61.82799141938912]
Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets.<n>We introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset.
arXiv Detail & Related papers (2025-02-10T12:30:25Z)
Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs) SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z)
Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.