Debunking with Dialogue? Exploring AI-Generated Counterspeech to Challenge Conspiracy Theories
- URL: http://arxiv.org/abs/2504.16604v1
- Date: Wed, 23 Apr 2025 10:32:45 GMT
- Title: Debunking with Dialogue? Exploring AI-Generated Counterspeech to Challenge Conspiracy Theories
- Authors: Mareike Lisker, Christina Gottschalk, Helena Mihaljević,
- Abstract summary: We evaluate the ability of GPT-4o, Llama 3, and Mistral to effectively apply counterspeech strategies derived from psychological research provided through structured prompts.<n>Our results show that the models often generate generic, repetitive, or superficial results.<n>They over-acknowledge fear and frequently hallucinate facts, sources, or figures, making their prompt-based use problematic.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Counterspeech is a key strategy against harmful online content, but scaling expert-driven efforts is challenging. Large Language Models (LLMs) present a potential solution, though their use in countering conspiracy theories is under-researched. Unlike for hate speech, no datasets exist that pair conspiracy theory comments with expert-crafted counterspeech. We address this gap by evaluating the ability of GPT-4o, Llama 3, and Mistral to effectively apply counterspeech strategies derived from psychological research provided through structured prompts. Our results show that the models often generate generic, repetitive, or superficial results. Additionally, they over-acknowledge fear and frequently hallucinate facts, sources, or figures, making their prompt-based use in practical applications problematic.
Related papers
- The Illusionist's Prompt: Exposing the Factual Vulnerabilities of Large Language Models with Linguistic Nuances [23.908718176644634]
Large Language Models (LLMs) are increasingly relied upon as real-time sources of information by non-expert users.<n>We introduce The Illusionist's Prompt, a novel hallucination attack that incorporates linguistic nuances into adversarial queries.<n>Our attack automatically generates highly transferrable illusory prompts to induce internal factual errors, all while preserving user intent and semantics.
arXiv Detail & Related papers (2025-04-01T07:10:00Z) - Chaos with Keywords: Exposing Large Language Models Sycophantic Hallucination to Misleading Keywords and Evaluating Defense Strategies [47.92996085976817]
This study explores the sycophantic tendencies of Large Language Models (LLMs)
LLMs tend to provide answers that match what users want to hear, even if they are not entirely correct.
arXiv Detail & Related papers (2024-06-06T08:03:05Z) - Classifying Conspiratorial Narratives At Scale: False Alarms and Erroneous Connections [4.594855794205588]
This work establishes a general scheme for classifying discussions related to conspiracy theories.
We leverage human-labeled ground truth to train a BERT-based model for classifying online CTs.
We present the first large-scale classification study using posts from the most active conspiracy-related Reddit forums.
arXiv Detail & Related papers (2024-03-29T20:29:12Z) - Outcome-Constrained Large Language Models for Countering Hate Speech [10.434435022492723]
This study aims to develop methods for generating counterspeech constrained by conversation outcomes.
We experiment with large language models (LLMs) to incorporate into the text generation process two desired conversation outcomes.
Evaluation results show that our methods effectively steer the generation of counterspeech toward the desired outcomes.
arXiv Detail & Related papers (2024-03-25T19:44:06Z) - An Investigation of Large Language Models for Real-World Hate Speech
Detection [46.15140831710683]
A major limitation of existing methods is that hate speech detection is a highly contextual problem.
Recently, large language models (LLMs) have demonstrated state-of-the-art performance in several natural language tasks.
Our study reveals that a meticulously crafted reasoning prompt can effectively capture the context of hate speech.
arXiv Detail & Related papers (2024-01-07T00:39:33Z) - HateRephrase: Zero- and Few-Shot Reduction of Hate Intensity in Online
Posts using Large Language Models [4.9711707739781215]
This paper investigates an approach of suggesting a rephrasing of potential hate speech content even before the post is made.
We develop 4 different prompts based on task description, hate definition, few-shot demonstrations and chain-of-thoughts.
We find that GPT-3.5 outperforms the baseline and open-source models for all the different kinds of prompts.
arXiv Detail & Related papers (2023-10-21T12:18:29Z) - Effective Prompt Extraction from Language Models [70.00099540536382]
We present a framework for measuring the effectiveness of prompt extraction attacks.
In experiments with 3 different sources of prompts and 11 underlying large language models, we find that simple text-based attacks can in fact reveal prompts with high probability.
Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination.
arXiv Detail & Related papers (2023-07-13T16:15:08Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Characterizing the adversarial vulnerability of speech self-supervised
learning [95.03389072594243]
We make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries.
The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries.
arXiv Detail & Related papers (2021-11-08T08:44:04Z) - Generating Counter Narratives against Online Hate Speech: Data and
Strategies [21.098614110697184]
We present a study on how to collect responses to hate effectively.
We employ large scale unsupervised language models such as GPT-2 for the generation of silver data.
The best annotation strategies/neural architectures can be used for data filtering before expert validation/post-editing.
arXiv Detail & Related papers (2020-04-08T19:35:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.