Related papers: Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon

Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon

URL: http://arxiv.org/abs/2505.22184v2
Date: Thu, 05 Jun 2025 04:47:25 GMT
Title: Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon
Authors: Xuchen Ma, Jianxiang Yu, Wenming Shao, Bo Pang, Xiang Li,
Abstract summary: Social media platforms have experienced a significant rise in toxic content, including abusive language and discriminatory remarks.<n>Existing methods are mostly designed for English texts, while Chinese cloaked toxicity unveiling has not been solved yet.<n>We propose C$2$TU, a novel training-free and prompt-free method for Chinese cloaked toxic content unveiling.
Score: 10.538492229433409
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Social media platforms have experienced a significant rise in toxic content, including abusive language and discriminatory remarks, presenting growing challenges for content moderation. Some users evade censorship by deliberately disguising toxic words through homophonic cloak, which necessitates the task of unveiling cloaked toxicity. Existing methods are mostly designed for English texts, while Chinese cloaked toxicity unveiling has not been solved yet. To tackle the issue, we propose C$^2$TU, a novel training-free and prompt-free method for Chinese cloaked toxic content unveiling. It first employs substring matching to identify candidate toxic words based on Chinese homo-graph and toxic lexicon. Then it filters those candidates that are non-toxic and corrects cloaks to be their corresponding toxicities. Specifically, we develop two model variants for filtering, which are based on BERT and LLMs, respectively. For LLMs, we address the auto-regressive limitation in computing word occurrence probability and utilize the full semantic contexts of a text sequence to reveal cloaked toxic words. Extensive experiments demonstrate that C$^2$TU can achieve superior performance on two Chinese toxic datasets. In particular, our method outperforms the best competitor by up to 71% on the F1 score and 35% on accuracy, respectively. Our code and data are available at https://github.com/XDxc-cuber/C2TU-Chinese-cloaked-toxicity-unveiling.

Related papers

Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings [48.841514684592426]
We highlight the multimodal nature of Chinese language as a key challenge for deploying language models in toxic Chinese detection.<n>First, we propose a taxonomy of 3 perturbation strategies and 8 specific approaches in toxic Chinese content.<n>Then, we curate a dataset based on this taxonomy, and benchmark 9 SOTA LLMs (from both the US and China) to assess if they can detect perturbed toxic Chinese text.
arXiv Detail & Related papers (2025-05-30T08:32:45Z)
Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites [39.3555146467512]
ToxiRewriteCN is the first Chinese dataset explicitly designed to preserve sentiment polarity.<n>It comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans.<n>It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues.
arXiv Detail & Related papers (2025-05-21T09:27:18Z)
Multilingual and Explainable Text Detoxification with Parallel Corpora [58.83211571400692]
We extend parallel text detoxification corpus to new languages.<n>We conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences.<n>We then experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach.
arXiv Detail & Related papers (2024-12-16T12:08:59Z)
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models [51.713448010799986]
We propose textbfToxic Subword textbfPruning (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs. ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously.
arXiv Detail & Related papers (2024-10-05T13:30:33Z)
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts [13.470734853274587]
Large language models (LLMs) are increasingly popular but are also prone to generating bias, toxic or harmful language. We create and release FrenchToxicityPrompts, a dataset of 50K naturally occurring French prompts. We evaluate 14 different models from four prevalent open-sourced families of LLMs against our dataset to assess their potential toxicity.
arXiv Detail & Related papers (2024-06-25T14:02:11Z)
Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z)
Facilitating Fine-grained Detection of Chinese Toxic Language: Hierarchical Taxonomy, Resources, and Benchmarks [18.44630180661091]
Existing datasets lack fine-grained annotation of toxic types and expressions. It is crucial to introduce lexical knowledge to detect the toxicity of posts. In this paper, we facilitate the fine-grained detection of Chinese toxic language.
arXiv Detail & Related papers (2023-05-08T03:50:38Z)
COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences. We also propose textscCOLDetector to study output offensiveness of popular Chinese language models. Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z)
Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language. We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.