The Language Barrier: Dissecting Safety Challenges of LLMs in
Multilingual Contexts
- URL: http://arxiv.org/abs/2401.13136v1
- Date: Tue, 23 Jan 2024 23:12:09 GMT
- Title: The Language Barrier: Dissecting Safety Challenges of LLMs in
Multilingual Contexts
- Authors: Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang,
Haoran Xu, Boyuan Zheng, Philipp Koehn, Daniel Khashabi
- Abstract summary: This paper examines the variations in safety challenges faced by large language models across different languages.
We compare how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages.
- Score: 46.089025223336854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As the influence of large language models (LLMs) spans across global
communities, their safety challenges in multilingual settings become paramount
for alignment research. This paper examines the variations in safety challenges
faced by LLMs across different languages and discusses approaches to
alleviating such concerns. By comparing how state-of-the-art LLMs respond to
the same set of malicious prompts written in higher- vs. lower-resource
languages, we observe that (1) LLMs tend to generate unsafe responses much more
often when a malicious prompt is written in a lower-resource language, and (2)
LLMs tend to generate more irrelevant responses to malicious prompts in
lower-resource languages. To understand where the discrepancy can be
attributed, we study the effect of instruction tuning with reinforcement
learning from human feedback (RLHF) or supervised finetuning (SFT) on the
HH-RLHF dataset. Surprisingly, while training with high-resource languages
improves model alignment, training in lower-resource languages yields minimal
improvement. This suggests that the bottleneck of cross-lingual alignment is
rooted in the pretraining stage. Our findings highlight the challenges in
cross-lingual LLM safety, and we hope they inform future research in this
direction.
Related papers
- Benchmarking LLM Guardrails in Handling Multilingual Toxicity [57.296161186129545]
We introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails.
We investigate the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance.
Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts.
arXiv Detail & Related papers (2024-10-29T15:51:24Z) - Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks [18.208272960774337]
Large Language Models (LLMs) have sparked widespread concerns about their safety.
Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning.
We take a further step to understand fine-tuning attacks in multilingual LLMs.
arXiv Detail & Related papers (2024-10-23T18:27:36Z) - Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture [6.17896401271963]
We introduce Multilingual Blending, a mixed-language query-response scheme designed to evaluate the safety alignment of various large language models.
We investigate language patterns such as language availability, morphology, and language family that could impact the effectiveness of Multilingual Blending.
arXiv Detail & Related papers (2024-07-10T03:26:15Z) - A Chinese Dataset for Evaluating the Safeguards in Large Language Models [46.43476815725323]
Large language models (LLMs) can produce harmful responses.
This paper introduces a dataset for the safety evaluation of Chinese LLMs.
We then extend it to two other scenarios that can be used to better identify false negative and false positive examples.
arXiv Detail & Related papers (2024-02-19T14:56:18Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs)
We consider two potential risky scenarios: unintentional and intentional.
We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.