Related papers: Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification

Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification

URL: http://arxiv.org/abs/2601.08668v1
Date: Tue, 13 Jan 2026 15:45:31 GMT
Title: Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification
Authors: Kyuri Im, Shuzhou Yuan, Michael Färber,
Abstract summary: We investigate false refusal behavior in hate speech detoxification.<n>We show that large language models (LLMs) disproportionately refuse inputs with higher semantic toxicity.<n>We propose a simple cross-translation strategy, translating English hate speech into Chinese for detoxification and back.
Score: 7.696781721646013
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While large language models (LLMs) have increasingly been applied to hate speech detoxification, the prompts often trigger safety alerts, causing LLMs to refuse the task. In this study, we systematically investigate false refusal behavior in hate speech detoxification and analyze the contextual and linguistic biases that trigger such refusals. We evaluate nine LLMs on both English and multilingual datasets, our results show that LLMs disproportionately refuse inputs with higher semantic toxicity and those targeting specific groups, particularly nationality, religion, and political ideology. Although multilingual datasets exhibit lower overall false refusal rates than English datasets, models still display systematic, language-dependent biases toward certain targets. Based on these findings, we propose a simple cross-translation strategy, translating English hate speech into Chinese for detoxification and back, which substantially reduces false refusals while preserving the original content, providing an effective and lightweight mitigation approach.

Related papers

Controlling Language Confusion in Multilingual LLMs [0.0]
Large language models often suffer from language confusion, a phenomenon in which responses are partially or entirely generated in unintended languages.<n>In this work, we apply ORPO, which adds penalties for unwanted output styles to standard SFT, effectively suppressing language-confused generations.
arXiv Detail & Related papers (2025-05-25T12:15:31Z)
Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z)
Refusal Direction is Universal Across Safety-Aligned Languages [66.64709923081745]
In this paper, we investigate the refusal behavior in large language models (LLMs) across 14 languages using PolyRefuse.<n>We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness.<n>We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks.
arXiv Detail & Related papers (2025-05-22T21:54:46Z)
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners [111.50503126693444]
We show that language-specific ablation consistently boosts multilingual reasoning performance.<n>Compared to post-training, our training-free ablation achieves comparable or superior results with minimal computational overhead.
arXiv Detail & Related papers (2025-05-21T08:35:05Z)
Assessing Agentic Large Language Models in Multilingual National Bias [31.67058518564021]
Cross-language disparities in reasoning-based recommendations remain largely unexplored.<n>This study is the first to address this gap.<n>We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages.
arXiv Detail & Related papers (2025-02-25T08:07:42Z)
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation [6.781972039785424]
Recent generative large language models (LLMs) show remarkable performance in non-English languages.<n>When prompted in those languages they tend to express higher harmful social biases and toxicity levels.<n>We investigate the impact of different finetuning methods on the model's bias and toxicity, but also on its ability to produce fluent and diverse text.
arXiv Detail & Related papers (2024-12-18T17:05:08Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Language models are not naysayers: An analysis of language models on negation benchmarks [58.32362243122714]
We evaluate the ability of current-generation auto-regressive language models to handle negation. We show that LLMs have several limitations including insensitivity to the presence of negation, an inability to capture the lexical semantics of negation, and a failure to reason under negation.
arXiv Detail & Related papers (2023-06-14T01:16:37Z)
Model-Agnostic Meta-Learning for Multilingual Hate Speech Detection [23.97444551607624]
Hate speech in social media is a growing phenomenon, and detecting such toxic content has gained significant traction. HateMAML is a model-agnostic meta-learning-based framework that effectively performs hate speech detection in low-resource languages. Extensive experiments are conducted on five datasets across eight different low-resource languages.
arXiv Detail & Related papers (2023-03-04T22:28:29Z)
Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages. We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language. We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.