Fortifying Toxic Speech Detectors Against Veiled Toxicity
- URL: http://arxiv.org/abs/2010.03154v1
- Date: Wed, 7 Oct 2020 04:43:48 GMT
- Title: Fortifying Toxic Speech Detectors Against Veiled Toxicity
- Authors: Xiaochuang Han, Yulia Tsvetkov
- Abstract summary: We propose a framework aimed at fortifying existing toxic speech detectors without a large labeled corpus of veiled toxicity.
Just a handful of probing examples are used to surface orders of magnitude more disguised offenses.
- Score: 38.20984369410193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern toxic speech detectors are incompetent in recognizing disguised
offensive language, such as adversarial attacks that deliberately avoid known
toxic lexicons, or manifestations of implicit bias. Building a large annotated
dataset for such veiled toxicity can be very expensive. In this work, we
propose a framework aimed at fortifying existing toxic speech detectors without
a large labeled corpus of veiled toxicity. Just a handful of probing examples
are used to surface orders of magnitude more disguised offenses. We augment the
toxic speech detector's training data with these discovered offensive examples,
thereby making it more robust to veiled toxicity while preserving its utility
in detecting overt toxicity.
Related papers
- Towards Building a Robust Toxicity Predictor [13.162016701556725]
This paper presents a novel adversarial attack, texttToxicTrap, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign.
Two novel goal function designs allow ToxicTrap to identify weaknesses in both multiclass and multilabel toxic language detectors.
arXiv Detail & Related papers (2024-04-09T22:56:05Z) - Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use.
We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting.
We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z) - Facilitating Fine-grained Detection of Chinese Toxic Language:
Hierarchical Taxonomy, Resources, and Benchmarks [18.44630180661091]
Existing datasets lack fine-grained annotation of toxic types and expressions.
It is crucial to introduce lexical knowledge to detect the toxicity of posts.
In this paper, we facilitate the fine-grained detection of Chinese toxic language.
arXiv Detail & Related papers (2023-05-08T03:50:38Z) - Robust Conversational Agents against Imperceptible Toxicity Triggers [29.71051151620196]
We propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency.
We then propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow.
arXiv Detail & Related papers (2022-05-05T01:48:39Z) - ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and
Implicit Hate Speech Detection [33.715318646717385]
ToxiGen is a large-scale dataset of 274k toxic and benign statements about 13 minority groups.
Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale.
We find that 94.5% of toxic examples are labeled as hate speech by human annotators.
arXiv Detail & Related papers (2022-03-17T17:57:56Z) - Toxicity Detection can be Sensitive to the Conversational Context [64.28043776806213]
We construct and publicly release a dataset of 10,000 posts with two kinds of toxicity labels.
We introduce a new task, context sensitivity estimation, which aims to identify posts whose perceived toxicity changes if the context is also considered.
arXiv Detail & Related papers (2021-11-19T13:57:26Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.