Related papers: On the definition of toxicity in NLP

Related papers

Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA [0.0]
This dataset removes 7,531 of toxic image-text pairs in the LLaVA pre-training dataset.<n>We offer guidelines for implementing robust toxicity detection pipelines.
arXiv Detail & Related papers (2025-05-09T18:01:50Z)
When Bad Data Leads to Good Models [44.897123018926486]
In large language model (LLM) pretraining, data quality is believed to determine model quality.<n>We re-examine the notion of "quality" from the perspective of pre- and post-training co-design.
arXiv Detail & Related papers (2025-05-07T19:17:49Z)
Redefining Toxicity: An Objective and Context-Aware Approach for Stress-Level-Based Detection [1.9424018922013224]
This study introduces a novel, objective, and context-aware framework for toxicity detection. We propose new definition, metric and training approach as a parts of our framework and demonstrate it's effectiveness.
arXiv Detail & Related papers (2025-03-20T12:09:01Z)
Aligned Probing: Relating Toxic Behavior and Model Internals [66.49887503194101]
We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs) Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time. Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers.
arXiv Detail & Related papers (2025-03-17T17:23:50Z)
Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks. It is quite beneficial and challenging to detect poisoned samples from a mixed dataset. We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z)
ToXCL: A Unified Framework for Toxic Speech Detection and Explanation [3.803993344850168]
ToXCL is a unified framework for the detection and explanation of implicit toxic speech. ToXCL achieves new state-of-the-art effectiveness, and outperforms baselines significantly.
arXiv Detail & Related papers (2024-03-25T12:21:38Z)
Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z)
CMD: a framework for Context-aware Model self-Detoxification [22.842468869653818]
Text detoxification aims to minimize the risk of language models producing toxic content. Existing detoxification methods fail to achieve a decent balance between detoxification effectiveness and generation quality. We introduce a Context-aware Model self-Detoxification(CMD) framework that pays attention to both the context and the detoxification process.
arXiv Detail & Related papers (2023-08-16T11:50:38Z)
Toxicity Inspector: A Framework to Evaluate Ground Truth in Toxicity Detection Through Feedback [0.0]
This paper introduces a toxicity inspector framework that incorporates a human-in-the-loop pipeline. It aims to enhance the reliability of toxicity benchmark datasets by centering the evaluator's values through an iterative feedback cycle.
arXiv Detail & Related papers (2023-05-11T11:56:42Z)
Detoxifying Language Models with a Toxic Corpus [16.7345472998388]
We propose to use toxic corpus as an additional resource to reduce the toxicity. Our result shows that toxic corpus can indeed help to reduce the toxicity of the language generation process substantially.
arXiv Detail & Related papers (2022-04-30T18:25:18Z)
Toxicity Detection can be Sensitive to the Conversational Context [64.28043776806213]
We construct and publicly release a dataset of 10,000 posts with two kinds of toxicity labels. We introduce a new task, context sensitivity estimation, which aims to identify posts whose perceived toxicity changes if the context is also considered.
arXiv Detail & Related papers (2021-11-19T13:57:26Z)
Mitigating Biases in Toxic Language Detection through Invariant Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns. Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z)
ToxCCIn: Toxic Content Classification with Interpretability [16.153683223016973]
Explanations are important for tasks like offensive language or toxicity detection on social media. We propose a technique to improve the interpretability of transformer models, based on a simple and powerful assumption. We find this approach effective and can produce explanations that exceed the quality of those provided by Logistic Regression analysis.
arXiv Detail & Related papers (2021-03-01T22:17:10Z)
Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language. We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.