Multilingual Jailbreak Challenges in Large Language Models
- URL: http://arxiv.org/abs/2310.06474v3
- Date: Mon, 4 Mar 2024 04:03:54 GMT
- Title: Multilingual Jailbreak Challenges in Large Language Models
- Authors: Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing
- Abstract summary: In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs)
We consider two potential risky scenarios: unintentional and intentional.
We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
- Score: 96.74878032417054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large language models (LLMs) exhibit remarkable capabilities across a
wide range of tasks, they pose potential safety concerns, such as the
``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to
exhibit undesirable behavior. Although several preventive measures have been
developed to mitigate the potential risks associated with LLMs, they have
primarily focused on English. In this study, we reveal the presence of
multilingual jailbreak challenges within LLMs and consider two potential risky
scenarios: unintentional and intentional. The unintentional scenario involves
users querying LLMs using non-English prompts and inadvertently bypassing the
safety mechanisms, while the intentional scenario concerns malicious users
combining malicious instructions with multilingual prompts to deliberately
attack LLMs. The experimental results reveal that in the unintentional
scenario, the rate of unsafe content increases as the availability of languages
decreases. Specifically, low-resource languages exhibit about three times the
likelihood of encountering harmful content compared to high-resource languages,
with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts
can exacerbate the negative impact of malicious instructions, with
astonishingly high rates of unsafe output: 80.92\% for ChatGPT and 40.71\% for
GPT-4. To handle such a challenge in the multilingual context, we propose a
novel \textsc{Self-Defense} framework that automatically generates multilingual
training data for safety fine-tuning. Experimental results show that ChatGPT
fine-tuned with such data can achieve a substantial reduction in unsafe content
generation. Data is available at
\url{https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs}.
Related papers
- Benchmarking LLM Guardrails in Handling Multilingual Toxicity [57.296161186129545]
We introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails.
We investigate the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance.
Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts.
arXiv Detail & Related papers (2024-10-29T15:51:24Z) - Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks [18.208272960774337]
Large Language Models (LLMs) have sparked widespread concerns about their safety.
Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning.
We take a further step to understand fine-tuning attacks in multilingual LLMs.
arXiv Detail & Related papers (2024-10-23T18:27:36Z) - Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models [23.522660090382832]
We investigate the effectiveness of many-shot jailbreaking, where models are prompted with unsafe demonstrations to induce unsafe behaviour, in Italian.
We find that the models exhibit unsafe behaviors even when prompted with few unsafe demonstrations, and -- more alarmingly -- that this tendency rapidly escalates with more demonstrations.
arXiv Detail & Related papers (2024-08-08T15:24:03Z) - A Chinese Dataset for Evaluating the Safeguards in Large Language Models [46.43476815725323]
Large language models (LLMs) can produce harmful responses.
This paper introduces a dataset for the safety evaluation of Chinese LLMs.
We then extend it to two other scenarios that can be used to better identify false negative and false positive examples.
arXiv Detail & Related papers (2024-02-19T14:56:18Z) - The Language Barrier: Dissecting Safety Challenges of LLMs in
Multilingual Contexts [46.089025223336854]
This paper examines the variations in safety challenges faced by large language models across different languages.
We compare how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages.
arXiv Detail & Related papers (2024-01-23T23:12:09Z) - Low-Resource Languages Jailbreak GPT-4 [19.97929171158234]
Our work exposes the inherent cross-lingual vulnerability of AI safety training and red-teaming of large language models (LLMs)
On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time.
Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages.
arXiv Detail & Related papers (2023-10-03T21:30:56Z) - All Languages Matter: On the Multilingual Safety of Large Language Models [96.47607891042523]
We build the first multilingual safety benchmark for large language models (LLMs)
XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families.
We propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT.
arXiv Detail & Related papers (2023-10-02T05:23:34Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.