Related papers: Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models

Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models

URL: http://arxiv.org/abs/2408.04522v1
Date: Thu, 8 Aug 2024 15:24:03 GMT
Title: Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models
Authors: Fabio Pernisi, Dirk Hovy, Paul Röttger,
Abstract summary: We investigate the effectiveness of many-shot jailbreaking, where models are prompted with unsafe demonstrations to induce unsafe behaviour, in Italian. We find that the models exhibit unsafe behaviors even when prompted with few unsafe demonstrations, and -- more alarmingly -- that this tendency rapidly escalates with more demonstrations.
Score: 23.522660090382832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As diverse linguistic communities and users adopt large language models (LLMs), assessing their safety across languages becomes critical. Despite ongoing efforts to make LLMs safe, they can still be made to behave unsafely with jailbreaking, a technique in which models are prompted to act outside their operational guidelines. Research on LLM safety and jailbreaking, however, has so far mostly focused on English, limiting our understanding of LLM safety in other languages. We contribute towards closing this gap by investigating the effectiveness of many-shot jailbreaking, where models are prompted with unsafe demonstrations to induce unsafe behaviour, in Italian. To enable our analysis, we create a new dataset of unsafe Italian question-answer pairs. With this dataset, we identify clear safety vulnerabilities in four families of open-weight LLMs. We find that the models exhibit unsafe behaviors even when prompted with few unsafe demonstrations, and -- more alarmingly -- that this tendency rapidly escalates with more demonstrations.

Related papers

Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models [0.0]
We show how surprisingly strong attacks can be created by altering just a few characters and using a small proxy model for word importance calculation.<n>We find that these character and word-level attacks drastically alter the predictions of different LLMs.<n>We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language.
arXiv Detail & Related papers (2025-06-09T11:09:39Z)
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z)
MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety [56.79292318645454]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited.<n>We introduce a multilingual guardrail with reasoning for prompt classification.
arXiv Detail & Related papers (2025-04-21T17:15:06Z)
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks [55.29301192316118]
Large language models (LLMs) are highly vulnerable to jailbreaking attacks. We propose a safety steering framework grounded in safe control theory. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor.
arXiv Detail & Related papers (2025-02-28T21:10:03Z)
Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models [44.27350994698781]
We propose a novel framework to examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into structured non-natural query language. We conduct extensive experiments on mainstream LLMs, and the results show that QueryAttack can achieve high attack success rates.
arXiv Detail & Related papers (2025-02-13T19:13:03Z)
LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps [63.10843814055688]
M-ALERT is a benchmark that evaluates the safety of Large Language Models in five languages: English, French, German, Italian, and Spanish. M-ALERT includes 15k high-quality prompts per language, totaling 75k, following the detailed ALERT taxonomy.
arXiv Detail & Related papers (2024-12-19T16:46:54Z)
Playing Language Game with LLMs Leads to Jailbreaking [18.63358696510664]
We introduce two novel jailbreak methods based on mismatched language games and custom language games. We demonstrate the effectiveness of our methods, achieving success rates of 93% on GPT-4o, 89% on GPT-4o-mini and 83% on Claude-3.5-Sonnet.
arXiv Detail & Related papers (2024-11-16T13:07:13Z)
Diversity Helps Jailbreak Large Language Models [16.34618038553998]
We have uncovered a powerful jailbreak technique that leverages large language models' ability to diverge from prior context. By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches. This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them.
arXiv Detail & Related papers (2024-11-06T19:39:48Z)
Benchmarking LLM Guardrails in Handling Multilingual Toxicity [57.296161186129545]
We introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We investigate the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts.
arXiv Detail & Related papers (2024-10-29T15:51:24Z)
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks [18.208272960774337]
Large Language Models (LLMs) have sparked widespread concerns about their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning. We take a further step to understand fine-tuning attacks in multilingual LLMs.
arXiv Detail & Related papers (2024-10-23T18:27:36Z)
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States [65.45603614354329]
Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content. We employ weak classifiers to explain LLM safety through the intermediate hidden states.
arXiv Detail & Related papers (2024-06-09T05:04:37Z)
Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z)
Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs) We consider two potential risky scenarios: unintentional and intentional. We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z)
Low-Resource Languages Jailbreak GPT-4 [19.97929171158234]
Our work exposes the inherent cross-lingual vulnerability of AI safety training and red-teaming of large language models (LLMs) On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages.
arXiv Detail & Related papers (2023-10-03T21:30:56Z)
All Languages Matter: On the Multilingual Safety of Large Language Models [96.47607891042523]
We build the first multilingual safety benchmark for large language models (LLMs) XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT.
arXiv Detail & Related papers (2023-10-02T05:23:34Z)
Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes. To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.