Related papers: LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps

LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps

URL: http://arxiv.org/abs/2412.15035v1
Date: Thu, 19 Dec 2024 16:46:54 GMT
Title: LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps
Authors: Felix Friedrich, Simone Tedeschi, Patrick Schramowski, Manuel Brack, Roberto Navigli, Huu Nguyen, Bo Li, Kristian Kersting,
Abstract summary: M-ALERT is a benchmark that evaluates the safety of Large Language Models in five languages: English, French, German, Italian, and Spanish.<n>M-ALERT includes 15k high-quality prompts per language, totaling 75k, following the detailed ALERT taxonomy.
Score: 63.10843814055688
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Building safe Large Language Models (LLMs) across multiple languages is essential in ensuring both safe access and linguistic diversity. To this end, we introduce M-ALERT, a multilingual benchmark that evaluates the safety of LLMs in five languages: English, French, German, Italian, and Spanish. M-ALERT includes 15k high-quality prompts per language, totaling 75k, following the detailed ALERT taxonomy. Our extensive experiments on 10 state-of-the-art LLMs highlight the importance of language-specific safety analysis, revealing that models often exhibit significant inconsistencies in safety across languages and categories. For instance, Llama3.2 shows high unsafety in the category crime_tax for Italian but remains safe in other languages. Similar differences can be observed across all models. In contrast, certain categories, such as substance_cannabis and crime_propaganda, consistently trigger unsafe responses across models and languages. These findings underscore the need for robust multilingual safety practices in LLMs to ensure safe and responsible usage across diverse user communities.

Related papers

CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications [1.235687336222824]
We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages.<n>Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering.<n>The resulting dataset, Nemotron-Content-Safety-Dataset-Multilingual-v1, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-Multilingual-8B-v1 via LoRA-based fine-tuning.
arXiv Detail & Related papers (2025-08-03T10:35:05Z)
MPO: Multilingual Safety Alignment via Reward Gap Optimization [88.76638442683391]
Large language models (LLMs) have become increasingly central to AI applications worldwide.<n>Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data.<n>We introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages.
arXiv Detail & Related papers (2025-05-22T16:24:51Z)
MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning [56.79292318645454]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking. This vulnerability is exacerbated in multilingual setting, where multilingual safety-aligned data are often limited. We propose an approach to build a multilingual guardrail with reasoning.
arXiv Detail & Related papers (2025-04-21T17:15:06Z)
PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages [27.318299273902984]
PolyGUARD is a new state-of-the-art multilingual safety model for safeguarding Large Language Models (LLMs) generations. It is trained on the largest multilingual safety training corpus to date containing 1.91M samples across 17 languages. PolyGUARDPROMPTS is a high quality multilingual benchmark with 29K samples for the evaluation of safety guardrails.
arXiv Detail & Related papers (2025-04-06T06:09:21Z)
Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts [40.0358736497799]
Large language models (LLMs) are known to have the potential to generate harmful content. This paper introduces Qorgau, a novel dataset specifically designed for safety evaluation in Kazakh and Russian.
arXiv Detail & Related papers (2025-02-19T11:33:22Z)
Benchmarking LLM Guardrails in Handling Multilingual Toxicity [57.296161186129545]
We introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We investigate the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts.
arXiv Detail & Related papers (2024-10-29T15:51:24Z)
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks [18.208272960774337]
Large Language Models (LLMs) have sparked widespread concerns about their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning. We take a further step to understand fine-tuning attacks in multilingual LLMs.
arXiv Detail & Related papers (2024-10-23T18:27:36Z)
Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models [23.522660090382832]
We investigate the effectiveness of many-shot jailbreaking, where models are prompted with unsafe demonstrations to induce unsafe behaviour, in Italian. We find that the models exhibit unsafe behaviors even when prompted with few unsafe demonstrations, and -- more alarmingly -- that this tendency rapidly escalates with more demonstrations.
arXiv Detail & Related papers (2024-08-08T15:24:03Z)
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture [6.17896401271963]
We introduce Multilingual Blending, a mixed-language query-response scheme designed to evaluate the safety alignment of various large language models. We investigate language patterns such as language availability, morphology, and language family that could impact the effectiveness of Multilingual Blending.
arXiv Detail & Related papers (2024-07-10T03:26:15Z)
Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs) We consider two potential risky scenarios: unintentional and intentional. We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z)
All Languages Matter: On the Multilingual Safety of Large Language Models [96.47607891042523]
We build the first multilingual safety benchmark for large language models (LLMs) XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT.
arXiv Detail & Related papers (2023-10-02T05:23:34Z)
SafetyBench: Evaluating the Safety of Large Language Models [54.878612385780805]
SafetyBench is a comprehensive benchmark for evaluating the safety of Large Language Models (LLMs) It comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Our tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts.
arXiv Detail & Related papers (2023-09-13T15:56:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.