Related papers: JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks

JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks

URL: http://arxiv.org/abs/2603.01291v1
Date: Sun, 01 Mar 2026 21:50:03 GMT
Title: JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks
Authors: Masahiro Kaneko, Ayana Niwa, Timothy Baldwin,
Abstract summary: JailNewsBench is the first benchmark for evaluating robustness against jailbreak-induced fake news generation.<n>For English and U.S.-related topics, the defensive performance of typical multi-lingual LLMs was significantly lower than for other regions.<n>Our analysis shows that coverage of fake news in existing safety datasets is limited and less well defended than major categories such as toxicity and social bias.
Score: 44.09742593777696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fake news undermines societal trust and decision-making across politics, economics, health, and international relations, and in extreme cases threatens human lives and societal safety. Because fake news reflects region-specific political, social, and cultural contexts and is expressed in language, evaluating the risks of large language models (LLMs) requires a multi-lingual and regional perspective. Malicious users can bypass safeguards through jailbreak attacks, inducing LLMs to generate fake news. However, no benchmark currently exists to systematically assess attack resilience across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak-induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering 8 evaluation sub-metrics through LLM-as-a-Judge and 5 jailbreak attacks, with approximately 300k instances. Our evaluation of 9 LLMs reveals that the maximum attack success rate (ASR) reached 86.3% and the maximum harmfulness score was 3.5 out of 5. Notably, for English and U.S.-related topics, the defensive performance of typical multi-lingual LLMs was significantly lower than for other regions, highlighting substantial imbalances in safety across languages and regions. In addition, our analysis shows that coverage of fake news in existing safety datasets is limited and less well defended than major categories such as toxicity and social bias. Our dataset and code are available at https://github.com/kanekomasahiro/jail_news_bench.

Related papers

Do Methods to Jailbreak and Defend LLMs Generalize Across Languages? [11.718639745472224]
This paper presents the first systematic multilingual evaluation of jailbreaks and defenses across ten languages.<n>We assess two jailbreak types: logical-expression-based and adversarial-prompt-based robustness.<n>Simple defenses can be effective, but are language- and model-dependent.
arXiv Detail & Related papers (2025-11-01T20:12:19Z)
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages [57.059267233093465]
Large Language Models (LLMs) have transformed natural language processing, but their safety mechanisms remain under-explored in low-resource, multilingual settings.<n>We introduce textsfSGToxicGuard, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context.<n>We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails.
arXiv Detail & Related papers (2025-09-18T08:14:34Z)
Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models [0.0]
We show how surprisingly strong attacks can be created by altering just a few characters and using a small proxy model for word importance calculation.<n>We find that these character and word-level attacks drastically alter the predictions of different LLMs.<n>We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language.
arXiv Detail & Related papers (2025-06-09T11:09:39Z)
SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks [90.41592442792181]
We propose a fine-grained benchmark SafeDialBench for evaluating the safety of Large Language Models (LLMs)<n>Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios.<n> Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks.
arXiv Detail & Related papers (2025-02-16T12:08:08Z)
LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Inconsistencies [63.10843814055688]
M-ALERT is a benchmark that evaluates the safety of Large Language Models in five languages.<n>M-ALERT includes 15k high-quality prompts per language, totaling 75k, with category-wise annotations.<n>Our experiments on 39 state-of-the-art LLMs highlight the importance of language-specific safety analysis.
arXiv Detail & Related papers (2024-12-19T16:46:54Z)
Benchmarking LLM Guardrails in Handling Multilingual Toxicity [57.296161186129545]
We introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We investigate the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts.
arXiv Detail & Related papers (2024-10-29T15:51:24Z)
Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs) We consider two potential risky scenarios: unintentional and intentional. We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z)
Low-Resource Languages Jailbreak GPT-4 [19.97929171158234]
Our work exposes the inherent cross-lingual vulnerability of AI safety training and red-teaming of large language models (LLMs) On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages.
arXiv Detail & Related papers (2023-10-03T21:30:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.