Probing LLMs for hate speech detection: strengths and vulnerabilities
- URL: http://arxiv.org/abs/2310.12860v2
- Date: Sat, 28 Oct 2023 05:07:31 GMT
- Title: Probing LLMs for hate speech detection: strengths and vulnerabilities
- Authors: Sarthak Roy, Ashish Harshavardhan, Animesh Mukherjee and Punyajoy Saha
- Abstract summary: We utilise different prompt variation, input information and evaluate large language models in zero shot setting.
We select three large language models (GPT-3.5, text-davinci and Flan-T5) and three datasets - HateXplain, implicit hate and ToxicSpans.
We find that on average including the target information in the pipeline improves the model performance substantially.
- Score: 8.626059038321724
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently efforts have been made by social media platforms as well as
researchers to detect hateful or toxic language using large language models.
However, none of these works aim to use explanation, additional context and
victim community information in the detection process. We utilise different
prompt variation, input information and evaluate large language models in zero
shot setting (without adding any in-context examples). We select three large
language models (GPT-3.5, text-davinci and Flan-T5) and three datasets -
HateXplain, implicit hate and ToxicSpans. We find that on average including the
target information in the pipeline improves the model performance substantially
(~20-30%) over the baseline across the datasets. There is also a considerable
effect of adding the rationales/explanations into the pipeline (~10-20%) over
the baseline across the datasets. In addition, we further provide a typology of
the error cases where these large language models fail to (i) classify and (ii)
explain the reason for the decisions they take. Such vulnerable points
automatically constitute 'jailbreak' prompts for these models and industry
scale safeguard techniques need to be developed to make the models robust
against such prompts.
Related papers
- Identifying and Mitigating Model Failures through Few-shot CLIP-aided
Diffusion Generation [65.268245109828]
We propose an end-to-end framework to generate text descriptions of failure modes associated with spurious correlations.
These descriptions can be used to generate synthetic data using generative models, such as diffusion models.
Our experiments have shown remarkable textbfimprovements in accuracy ($sim textbf21%$) on hard sub-populations.
arXiv Detail & Related papers (2023-12-09T04:43:49Z) - Generative AI for Hate Speech Detection: Evaluation and Findings [11.478263835391436]
generative AI has been utilized to generate large amounts of synthetic hate speech sequences.
In this chapter, we provide a review of relevant methods, experimental setups and evaluation of this approach.
It is an open question whether the sensitivity of models such as GPT-3.5, and onward, can be improved using similar techniques of text generation.
arXiv Detail & Related papers (2023-11-16T16:09:43Z) - Hate Speech Detection in Limited Data Contexts using Synthetic Data
Generation [1.9506923346234724]
We propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts.
We present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets.
Our findings show that a model trained on synthetic data performs comparably to, and in some cases outperforms, a model trained only on the samples available in the target domain.
arXiv Detail & Related papers (2023-10-04T15:10:06Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken
Language Understanding [13.352795145385645]
Large pretrained language models have demonstrated strong language understanding capabilities.
We evaluate several such models like ChatGPT and OPT of different sizes on multiple benchmarks.
We show, however, that the model is worse at slot filling, and its performance is sensitive to ASR errors.
arXiv Detail & Related papers (2023-05-22T21:59:26Z) - POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained
models [62.23255433487586]
We propose an unsupervised fine-tuning framework to fine-tune the model or prompt on the unlabeled target data.
We demonstrate how to apply our method to both language-augmented vision and masked-language models by aligning the discrete distributions extracted from the prompts and target data.
arXiv Detail & Related papers (2023-04-29T22:05:22Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Leveraging Multi-domain, Heterogeneous Data using Deep Multitask
Learning for Hate Speech Detection [21.410160004193916]
We propose a Convolution Neural Network based multi-task learning models (MTLs)footnotecode to leverage information from multiple sources.
Empirical analysis performed on three benchmark datasets shows the efficacy of the proposed approach.
arXiv Detail & Related papers (2021-03-23T09:31:01Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.