Related papers: Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering

Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering

URL: http://arxiv.org/abs/2410.03466v1
Date: Fri, 4 Oct 2024 14:31:37 GMT
Title: Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering
Authors: Helena Bonaldi, Greta Damo, Nicolás Benjamín Ocampo, Elena Cabrio, Serena Villata, Marco Guerini,
Abstract summary: We focus on two aspects of counterspeech generation to produce more cogent responses. First, we test whether the presence of safety guardrails hinders the quality of the generations. Secondly, we assess whether attacking a specific component of the hate speech results in a more effective argumentative strategy to fight online hate.
Score: 22.594296353433855
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The potential effectiveness of counterspeech as a hate speech mitigation strategy is attracting increasing interest in the NLG research community, particularly towards the task of automatically producing it. However, automatically generated responses often lack the argumentative richness which characterises expert-produced counterspeech. In this work, we focus on two aspects of counterspeech generation to produce more cogent responses. First, by investigating the tension between helpfulness and harmlessness of LLMs, we test whether the presence of safety guardrails hinders the quality of the generations. Secondly, we assess whether attacking a specific component of the hate speech results in a more effective argumentative strategy to fight online hate. By conducting an extensive human and automatic evaluation, we show how the presence of safety guardrails can be detrimental also to a task that inherently aims at fostering positive social interactions. Moreover, our results show that attacking a specific component of the hate speech, and in particular its implicit negative stereotype and its hateful parts, leads to higher-quality generations.

Related papers

Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework. It reformulates harmful queries into benign reasoning tasks. We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns [29.913089752247362]
Large Language Models (LLMs) have raised increasing concerns about their misuse in generating hate speech. We propose HateBench, a framework for benchmarking hate speech detectors on LLM-generated hate speech. Our results show that while detectors are generally effective in identifying LLM-generated hate speech, their performance degrades with newer versions of LLMs.
arXiv Detail & Related papers (2025-01-28T07:00:45Z)
Echoes of Discord: Forecasting Hater Reactions to Counterspeech [10.658005418397748]
This study analyzes the impact of counterspeech from the hater's perspective. We employ two strategies: a two-stage reaction predictor and a three-way classification model. Experimental results demonstrate that the 3-way classification model outperforms the two-stage reaction predictor.
arXiv Detail & Related papers (2025-01-27T17:33:38Z)
Generative AI may backfire for counterspeech [20.57872238271025]
We analyze whether contextualized counterspeech generated by state-of-the-art AI is effective in curbing online hate speech. We find that non-contextualized counterspeech employing a warning-of-consequence strategy significantly reduces online hate speech. However, contextualized counterspeech generated by LLMs proves ineffective and may even backfire.
arXiv Detail & Related papers (2024-11-22T14:47:00Z)
Assessing the Human Likeness of AI-Generated Counterspeech [10.434435022492723]
Counterspeech is a targeted response to counteract and challenge abusive or hateful content. Previous studies have proposed different strategies for automatically generated counterspeech. We investigate the human likeness of AI-generated counterspeech, a critical factor influencing effectiveness.
arXiv Detail & Related papers (2024-10-14T18:48:47Z)
Decoding Hate: Exploring Language Models' Reactions to Hate Speech [2.433983268807517]
This paper investigates the reactions of seven state-of-the-art Large Language Models to hate speech. We reveal the spectrum of responses these models produce, highlighting their capacity to handle hate speech inputs. We also discuss strategies to mitigate hate speech generation by LLMs, particularly through fine-tuning and guideline guardrailing.
arXiv Detail & Related papers (2024-10-01T15:16:20Z)
SWE2: SubWord Enriched and Significant Word Emphasized Framework for Hate Speech Detection [3.0460060805145517]
We propose a novel hate speech detection framework called SWE2, which only relies on the content of messages and automatically identifies hate speech. Experimental results show that our proposed model achieves 0.975 accuracy and 0.953 macro F1, outperforming 7 state-of-the-art baselines.
arXiv Detail & Related papers (2024-09-25T07:05:44Z)
Outcome-Constrained Large Language Models for Countering Hate Speech [10.434435022492723]
This study aims to develop methods for generating counterspeech constrained by conversation outcomes. We experiment with large language models (LLMs) to incorporate into the text generation process two desired conversation outcomes. Evaluation results show that our methods effectively steer the generation of counterspeech toward the desired outcomes.
arXiv Detail & Related papers (2024-03-25T19:44:06Z)
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models [62.72318564072706]
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences. Despite its advantages, RLHF relies on human annotators to rank the text. We propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors.
arXiv Detail & Related papers (2023-11-16T07:48:45Z)
CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations. We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z)
Measuring Equality in Machine Learning Security Defenses: A Case Study in Speech Recognition [56.69875958980474]
This work considers approaches to defending learned systems and how security defenses result in performance inequities across different sub-populations. We find that many methods that have been proposed can cause direct harm, like false rejection and unequal benefits from robustness training. We present a comparison of equality between two rejection-based defenses: randomized smoothing and neural rejection, finding randomized smoothing more equitable due to the sampling mechanism for minority groups.
arXiv Detail & Related papers (2023-02-17T16:19:26Z)
Characterizing the adversarial vulnerability of speech self-supervised learning [95.03389072594243]
We make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries.
arXiv Detail & Related papers (2021-11-08T08:44:04Z)
Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples. Yet, its passive nature inevitably prevents it from being immune to unknown attackers. We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z)
Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning [95.60856995067083]
This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms. We propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection. Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%.
arXiv Detail & Related papers (2021-06-01T07:10:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.