Something Just Like TRuST : Toxicity Recognition of Span and Target
- URL: http://arxiv.org/abs/2506.02326v1
- Date: Mon, 02 Jun 2025 23:48:16 GMT
- Title: Something Just Like TRuST : Toxicity Recognition of Span and Target
- Authors: Berk Atil, Namrata Sureddy, Rebecca J. Passonneau,
- Abstract summary: This paper introduces TRuST, a comprehensive dataset designed to improve toxicity detection.<n>We benchmark state-of-the-art large language models (LLMs) on toxicity detection, target group identification, and toxic span extraction.<n>We find that fine-tuned models consistently outperform zero-shot and few-shot prompting, though performance remains low for certain social groups.
- Score: 2.4169078025984825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Toxicity in online content, including content generated by language models, has become a critical concern due to its potential for negative psychological and social impact. This paper introduces TRuST, a comprehensive dataset designed to improve toxicity detection that merges existing datasets, and has labels for toxicity, target social group, and toxic spans. It includes a diverse range of target groups such as ethnicity, gender, religion, disability, and politics, with both human/machine-annotated and human machine-generated data. We benchmark state-of-the-art large language models (LLMs) on toxicity detection, target group identification, and toxic span extraction. We find that fine-tuned models consistently outperform zero-shot and few-shot prompting, though performance remains low for certain social groups. Further, reasoning capabilities do not significantly improve performance, indicating that LLMs have weak social reasoning skills.
Related papers
- ModelCitizens: Representing Community Voices in Online Safety [25.030166169972418]
We introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups.<n>Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation.
arXiv Detail & Related papers (2025-07-07T20:15:18Z) - GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace [62.68664365246247]
This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs)<n>We propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN.
arXiv Detail & Related papers (2025-05-20T08:29:11Z) - Aligned Probing: Relating Toxic Behavior and Model Internals [66.49887503194101]
We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs)<n>Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time.<n>Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers.
arXiv Detail & Related papers (2025-03-17T17:23:50Z) - Analyzing Toxicity in Open Source Software Communications Using Psycholinguistics and Moral Foundations Theory [5.03553492616371]
This paper investigates a machine learning-based approach for the automatic detection of toxic communications in Open Source Software (OSS)<n>We leverage psycholinguistic lexicons, and Moral Foundations Theory to analyze toxicity in two types of OSS communication channels; issue comments and code reviews.<n>Using moral values as features is more effective than linguistic cues, resulting in 67.50% F1-measure in identifying toxic instances in code review data and 64.83% in issue comments.
arXiv Detail & Related papers (2024-12-17T17:52:00Z) - Can LLMs Recognize Toxicity? A Structured Investigation Framework and Toxicity Metric [16.423707276483178]
We introduce a robust metric grounded on Large Language Models (LLMs) to flexibly measure toxicity according to the given definition.
Our results demonstrate outstanding performance in measuring toxicity within verified factors, improving on conventional metrics by 12 points in the F1 score.
arXiv Detail & Related papers (2024-02-10T07:55:27Z) - Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models [21.341749351654453]
The generation of toxic content by large language models (LLMs) remains a critical challenge for the safe deployment of language technology.<n>We propose a novel framework for implicit knowledge editing and controlled text generation by fine-tuning LLMs with a prototype-based contrastive perplexity objective.
arXiv Detail & Related papers (2024-01-16T16:49:39Z) - Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use.
We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting.
We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z) - ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and
Implicit Hate Speech Detection [33.715318646717385]
ToxiGen is a large-scale dataset of 274k toxic and benign statements about 13 minority groups.
Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale.
We find that 94.5% of toxic examples are labeled as hate speech by human annotators.
arXiv Detail & Related papers (2022-03-17T17:57:56Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - Toxic Language Detection in Social Media for Brazilian Portuguese: New
Dataset and Multilingual Analysis [4.251937086394346]
State-of-the-art BERT models were able to achieve 76% macro-F1 score using monolingual data in the binary case.
We show that large-scale monolingual data is still needed to create more accurate models.
arXiv Detail & Related papers (2020-10-09T13:05:19Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.