Related papers: Analyzing the Influence of Language Model-Generated Responses in Mitigating Hate Speech on Social Media Directed at Ukrainian Refugees in Poland

Analyzing the Influence of Language Model-Generated Responses in Mitigating Hate Speech on Social Media Directed at Ukrainian Refugees in Poland

URL: http://arxiv.org/abs/2311.16905v1
Date: Tue, 28 Nov 2023 16:08:42 GMT
Title: Analyzing the Influence of Language Model-Generated Responses in Mitigating Hate Speech on Social Media Directed at Ukrainian Refugees in Poland
Authors: Jakub Podolak, Szymon {\L}ukasik, Pawe{\l} Balawender, Jan Ossowski, Katarzyna B\k{a}kowicz, Piotr Sankowski
Abstract summary: This study investigates the potential of employing responses generated by Large Language Models (LLM) to counteract hate speech on social media. The goal was to minimize the propagation of hate speech directed at Ukrainian refugees in Poland. The results indicate that deploying LLM-generated responses as replies to harmful tweets effectively diminishes user engagement, as measured by likes/impressions.
Score: 2.5571889630399474
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the context of escalating hate speech and polarization on social media, this study investigates the potential of employing responses generated by Large Language Models (LLM), complemented with pertinent verified knowledge links, to counteract such trends. Through extensive A/B testing involving the posting of 753 automatically generated responses, the goal was to minimize the propagation of hate speech directed at Ukrainian refugees in Poland. The results indicate that deploying LLM-generated responses as replies to harmful tweets effectively diminishes user engagement, as measured by likes/impressions. When we respond to an original tweet, i.e., which is not a reply, we reduce the engagement of users by over 20\% without increasing the number of impressions. On the other hand, our responses increase the ratio of the number of replies to a harmful tweet to impressions, especially if the harmful tweet is not original. Additionally, the study examines how generated responses influence the overall sentiment of tweets in the discussion, revealing that our intervention does not significantly alter the mean sentiment. This paper suggests the implementation of an automatic moderation system to combat hate speech on social media and provides an in-depth analysis of the A/B experiment, covering methodology, data collection, and statistical outcomes. Ethical considerations and challenges are also discussed, offering guidance for the development of discourse moderation systems leveraging the capabilities of generative AI.

Related papers

Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation [51.44040615856536]
This paper analyzes large language models' ability to simulate social media engagement through action guided response generation. We benchmark GPT-4o-mini, O1-mini, and DeepSeek-R1 in social media engagement simulation regarding a major societal event.
arXiv Detail & Related papers (2025-02-17T17:43:08Z)
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns [29.913089752247362]
Large Language Models (LLMs) have raised increasing concerns about their misuse in generating hate speech.<n>We propose HateBench, a framework for benchmarking hate speech detectors on LLM-generated hate speech.<n>Our results show that while detectors are generally effective in identifying LLM-generated hate speech, their performance degrades with newer versions of LLMs.
arXiv Detail & Related papers (2025-01-28T07:00:45Z)
Generative AI may backfire for counterspeech [20.57872238271025]
We analyze whether contextualized counterspeech generated by state-of-the-art AI is effective in curbing online hate speech. We find that non-contextualized counterspeech employing a warning-of-consequence strategy significantly reduces online hate speech. However, contextualized counterspeech generated by LLMs proves ineffective and may even backfire.
arXiv Detail & Related papers (2024-11-22T14:47:00Z)
Modulating Language Model Experiences through Frictions [56.17593192325438]
Over-consumption of language model outputs risks propagating unchecked errors in the short-term and damaging human capabilities for critical thinking in the long-term. We propose selective frictions for language model experiences, inspired by behavioral science interventions, to dampen misuse.
arXiv Detail & Related papers (2024-06-24T16:31:11Z)
"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust [51.542856739181474]
We show how different natural language expressions of uncertainty impact participants' reliance, trust, and overall task performance. We find that first-person expressions decrease participants' confidence in the system and tendency to agree with the system's answers, while increasing participants' accuracy. Our findings suggest that using natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters.
arXiv Detail & Related papers (2024-05-01T16:43:55Z)
Don't Say No: Jailbreaking LLM by Suppressing Refusal [13.666830169722576]
In this study, we first uncover the reason why vanilla target loss is not optimal, then we explore and enhance the loss objective and introduce the DSN (Don't Say No) attack. The existing evaluation such as refusal keyword matching reveals numerous false positive and false negative instances. To overcome this challenge, we propose an Ensemble Evaluation pipeline that novelly incorporates Natural Language Inference (NLI) contradiction assessment and two external LLM evaluators.
arXiv Detail & Related papers (2024-04-25T07:15:23Z)
Outcome-Constrained Large Language Models for Countering Hate Speech [10.434435022492723]
This study aims to develop methods for generating counterspeech constrained by conversation outcomes. We experiment with large language models (LLMs) to incorporate into the text generation process two desired conversation outcomes. Evaluation results show that our methods effectively steer the generation of counterspeech toward the desired outcomes.
arXiv Detail & Related papers (2024-03-25T19:44:06Z)
Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIF [14.2594830589926]
Counterspeech, defined as a response to online hate speech, is increasingly used as a non-censorial solution. Our study introduces CoARL, a novel framework enhancing counterspeech generation by modeling the pragmatic implications underlying social biases in hateful statements. CoARL's first two phases involve sequential multi-instruction tuning, teaching the model to understand intents, reactions, and harms of offensive statements, and then learning task-specific low-rank adapter weights for generating intent-conditioned counterspeech.
arXiv Detail & Related papers (2024-03-15T08:03:49Z)
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes [61.916827858666906]
Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback. Recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. This paper proposes a method called Gradient Cuff to detect jailbreak attempts.
arXiv Detail & Related papers (2024-03-01T03:29:54Z)
HateRephrase: Zero- and Few-Shot Reduction of Hate Intensity in Online Posts using Large Language Models [4.9711707739781215]
This paper investigates an approach of suggesting a rephrasing of potential hate speech content even before the post is made. We develop 4 different prompts based on task description, hate definition, few-shot demonstrations and chain-of-thoughts. We find that GPT-3.5 outperforms the baseline and open-source models for all the different kinds of prompts.
arXiv Detail & Related papers (2023-10-21T12:18:29Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
Membership Inference Attacks Against Self-supervised Speech Models [62.73937175625953]
Self-supervised learning (SSL) on continuous speech has started gaining attention. We present the first privacy analysis on several SSL speech models using Membership Inference Attacks (MIA) under black-box access.
arXiv Detail & Related papers (2021-11-09T13:00:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.