Redefining Toxicity: An Objective and Context-Aware Approach for Stress-Level-Based Detection
- URL: http://arxiv.org/abs/2503.16072v1
- Date: Thu, 20 Mar 2025 12:09:01 GMT
- Title: Redefining Toxicity: An Objective and Context-Aware Approach for Stress-Level-Based Detection
- Authors: Sergey Berezin, Reza Farahbakhsh, Noel Crespi,
- Abstract summary: This study introduces a novel, objective, and context-aware framework for toxicity detection.<n>We propose new definition, metric and training approach as a parts of our framework and demonstrate it's effectiveness.
- Score: 1.9424018922013224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The fundamental problem of toxicity detection lies in the fact that the term "toxicity" is ill-defined. Such uncertainty causes researchers to rely on subjective and vague data during model training, which leads to non-robust and inaccurate results, following the 'garbage in - garbage out' paradigm. This study introduces a novel, objective, and context-aware framework for toxicity detection, leveraging stress levels as a key determinant of toxicity. We propose new definition, metric and training approach as a parts of our framework and demonstrate it's effectiveness using a dataset we collected.
Related papers
- Aligned Probing: Relating Toxic Behavior and Model Internals [66.49887503194101]
We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs)<n>Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time.<n>Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers.
arXiv Detail & Related papers (2025-03-17T17:23:50Z) - Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - ToXCL: A Unified Framework for Toxic Speech Detection and Explanation [3.803993344850168]
ToXCL is a unified framework for the detection and explanation of implicit toxic speech.
ToXCL achieves new state-of-the-art effectiveness, and outperforms baselines significantly.
arXiv Detail & Related papers (2024-03-25T12:21:38Z) - Can LLMs Recognize Toxicity? A Structured Investigation Framework and Toxicity Metric [16.423707276483178]
We introduce a robust metric grounded on Large Language Models (LLMs) to flexibly measure toxicity according to the given definition.
Our results demonstrate outstanding performance in measuring toxicity within verified factors, improving on conventional metrics by 12 points in the F1 score.
arXiv Detail & Related papers (2024-02-10T07:55:27Z) - On the definition of toxicity in NLP [2.1830650692803863]
This work suggests a new, stress-level-based definition of toxicity designed to be objective and context-aware.
On par with it, we also describe possible ways of applying this new definition to dataset creation and model training.
arXiv Detail & Related papers (2023-10-03T18:32:34Z) - On Practical Aspects of Aggregation Defenses against Data Poisoning
Attacks [58.718697580177356]
Attacks on deep learning models with malicious training samples are known as data poisoning.
Recent advances in defense strategies against data poisoning have highlighted the effectiveness of aggregation schemes in achieving certified poisoning robustness.
Here we focus on Deep Partition Aggregation, a representative aggregation defense, and assess its practical aspects, including efficiency, performance, and robustness.
arXiv Detail & Related papers (2023-06-28T17:59:35Z) - Toxicity Inspector: A Framework to Evaluate Ground Truth in Toxicity
Detection Through Feedback [0.0]
This paper introduces a toxicity inspector framework that incorporates a human-in-the-loop pipeline.
It aims to enhance the reliability of toxicity benchmark datasets by centering the evaluator's values through an iterative feedback cycle.
arXiv Detail & Related papers (2023-05-11T11:56:42Z) - Analysis and Detectability of Offline Data Poisoning Attacks on Linear
Dynamical Systems [0.30458514384586405]
We study how poisoning impacts the least-squares estimate through the lens of statistical testing.
We propose a stealthy data poisoning attack on the least-squares estimator that can escape classical statistical tests.
arXiv Detail & Related papers (2022-11-16T10:01:03Z) - Detoxifying Language Models with a Toxic Corpus [16.7345472998388]
We propose to use toxic corpus as an additional resource to reduce the toxicity.
Our result shows that toxic corpus can indeed help to reduce the toxicity of the language generation process substantially.
arXiv Detail & Related papers (2022-04-30T18:25:18Z) - Handling Bias in Toxic Speech Detection: A Survey [26.176340438312376]
We look at proposed methods for evaluating and mitigating bias in toxic speech detection.
Case study introduces the concept of bias shift due to knowledge-based bias mitigation.
Survey concludes with an overview of the critical challenges, research gaps, and future directions.
arXiv Detail & Related papers (2022-01-26T10:38:36Z) - Toxicity Detection can be Sensitive to the Conversational Context [64.28043776806213]
We construct and publicly release a dataset of 10,000 posts with two kinds of toxicity labels.
We introduce a new task, context sensitivity estimation, which aims to identify posts whose perceived toxicity changes if the context is also considered.
arXiv Detail & Related papers (2021-11-19T13:57:26Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.