Evaluating GPT-3 Generated Explanations for Hateful Content Moderation
- URL: http://arxiv.org/abs/2305.17680v4
- Date: Wed, 30 Aug 2023 16:17:27 GMT
- Title: Evaluating GPT-3 Generated Explanations for Hateful Content Moderation
- Authors: Han Wang, Ming Shan Hee, Md Rabiul Awal, Kenny Tsu Wei Choo, Roy
Ka-Wei Lee
- Abstract summary: We use GPT-3 to generate explanations for both hateful and non-hateful content.
A survey was conducted with 2,400 unique respondents to evaluate the generated explanations.
Our findings reveal that human evaluators rated the GPT-generated explanations as high quality in terms of linguistic fluency, informativeness, persuasiveness, and logical soundness.
- Score: 8.63841985804905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has focused on using large language models (LLMs) to generate
explanations for hate speech through fine-tuning or prompting. Despite the
growing interest in this area, these generated explanations' effectiveness and
potential limitations remain poorly understood. A key concern is that these
explanations, generated by LLMs, may lead to erroneous judgments about the
nature of flagged content by both users and content moderators. For instance,
an LLM-generated explanation might inaccurately convince a content moderator
that a benign piece of content is hateful. In light of this, we propose an
analytical framework for examining hate speech explanations and conducted an
extensive survey on evaluating such explanations. Specifically, we prompted
GPT-3 to generate explanations for both hateful and non-hateful content, and a
survey was conducted with 2,400 unique respondents to evaluate the generated
explanations. Our findings reveal that (1) human evaluators rated the
GPT-generated explanations as high quality in terms of linguistic fluency,
informativeness, persuasiveness, and logical soundness, (2) the persuasive
nature of these explanations, however, varied depending on the prompting
strategy employed, and (3) this persuasiveness may result in incorrect
judgments about the hatefulness of the content. Our study underscores the need
for caution in applying LLM-generated explanations for content moderation. Code
and results are available at https://github.com/Social-AI-Studio/GPT3-HateEval.
Related papers
- A Multi-Task Text Classification Pipeline with Natural Language Explanations: A User-Centric Evaluation in Sentiment Analysis and Offensive Language Identification in Greek Tweets [8.846643533783205]
This work introduces an early concept for a novel pipeline that can be used in text classification tasks.
It comprises of two models: a classifier for labelling the text and an explanation generator which provides the explanation.
Experiments are centred around the tasks of sentiment analysis and offensive language identification in Greek tweets.
arXiv Detail & Related papers (2024-10-14T08:41:31Z) - Exploring the Effect of Explanation Content and Format on User Comprehension and Trust [11.433655064494896]
We focus on explanations for a regression tool for assessing cancer risk.
We examine the effect of the explanations' content and format on the user-centric metrics of comprehension and trust.
arXiv Detail & Related papers (2024-08-30T16:36:53Z) - Evaluating the Reliability of Self-Explanations in Large Language Models [2.8894038270224867]
We evaluate two kinds of such self-explanations - extractive and counterfactual.
Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process.
We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results.
arXiv Detail & Related papers (2024-07-19T17:41:08Z) - Scenarios and Approaches for Situated Natural Language Explanations [18.022428746019582]
We collect a benchmarking dataset, Situation-Based Explanation.
This dataset contains 100 explanandums.
For each "explanandum paired with an audience" situation, we include a human-written explanation.
We examine three categories of prompting methods: rule-based prompting, meta-prompting, and in-context learning prompting.
arXiv Detail & Related papers (2024-06-07T15:56:32Z) - DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection [50.805599761583444]
Large language models are limited by challenges in factuality and hallucinations to be directly employed off-the-shelf for judging the veracity of news articles.
We propose Dell that identifies three key stages in misinformation detection where LLMs could be incorporated as part of the pipeline.
arXiv Detail & Related papers (2024-02-16T03:24:56Z) - Complementary Explanations for Effective In-Context Learning [77.83124315634386]
Large language models (LLMs) have exhibited remarkable capabilities in learning from explanations in prompts.
This work aims to better understand the mechanisms by which explanations are used for in-context learning.
arXiv Detail & Related papers (2022-11-25T04:40:47Z) - Are Hard Examples also Harder to Explain? A Study with Human and
Model-Generated Explanations [82.12092864529605]
We study the connection between explainability and sample hardness.
We compare human-written explanations with those generated by GPT-3.
We also find that hardness of the in-context examples impacts the quality of GPT-3 explanations.
arXiv Detail & Related papers (2022-11-14T16:46:14Z) - The Unreliability of Explanations in Few-Shot In-Context Learning [50.77996380021221]
We focus on two NLP tasks that involve reasoning over text, namely question answering and natural language inference.
We show that explanations judged as good by humans--those that are logically consistent with the input--usually indicate more accurate predictions.
We present a framework for calibrating model predictions based on the reliability of the explanations.
arXiv Detail & Related papers (2022-05-06T17:57:58Z) - Human Interpretation of Saliency-based Explanation Over Text [65.29015910991261]
We study saliency-based explanations over textual data.
We find that people often mis-interpret the explanations.
We propose a method to adjust saliencies based on model estimates of over- and under-perception.
arXiv Detail & Related papers (2022-01-27T15:20:32Z) - Reframing Human-AI Collaboration for Generating Free-Text Explanations [46.29832336779188]
We consider the task of generating free-text explanations using a small number of human-written examples.
We find that crowdworkers often prefer explanations generated by GPT-3 to crowdsourced human-written explanations.
We create a pipeline that combines GPT-3 with a supervised filter that incorporates humans-in-the-loop via binary acceptability judgments.
arXiv Detail & Related papers (2021-12-16T07:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.