Related papers: Multilingual Collaborative Defense for Large Language Models

Multilingual Collaborative Defense for Large Language Models

URL: http://arxiv.org/abs/2505.11835v1
Date: Sat, 17 May 2025 04:47:16 GMT
Title: Multilingual Collaborative Defense for Large Language Models
Authors: Hongliang Li, Jinan Xu, Gengping Cui, Changhao Guan, Fengran Mo, Kaiyu Huang,
Abstract summary: One notable vulnerability is the ability to bypass safeguards by translating harmful queries into rare or underrepresented languages.<n>Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios.<n>We propose Multilingual Collaborative Defense (MCD), a novel learning method that optimize a continuous, soft safety prompt automatically.
Score: 33.14454771097587
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The robustness and security of large language models (LLMs) has become a prominent research area. One notable vulnerability is the ability to bypass LLM safeguards by translating harmful queries into rare or underrepresented languages, a simple yet effective method of "jailbreaking" these models. Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to enhance multilingual safety. In this work, we investigate the correlation between various attack features across different languages and propose Multilingual Collaborative Defense (MCD), a novel learning method that optimizes a continuous, soft safety prompt automatically to facilitate multilingual safeguarding of LLMs. The MCD approach offers three advantages: First, it effectively improves safeguarding performance across multiple languages. Second, MCD maintains strong generalization capabilities while minimizing false refusal rates. Third, MCD mitigates the language safety misalignment caused by imbalances in LLM training corpora. To evaluate the effectiveness of MCD, we manually construct multilingual versions of commonly used jailbreak benchmarks, such as MaliciousInstruct and AdvBench, to assess various safeguarding methods. Additionally, we introduce these datasets in underrepresented (zero-shot) languages to verify the language transferability of MCD. The results demonstrate that MCD outperforms existing approaches in safeguarding against multilingual jailbreak attempts while also exhibiting strong language transfer capabilities. Our code is available at https://github.com/HLiang-Lee/MCD.

Related papers

On the Evaluation of Large Language Models in Multilingual Vulnerability Repair [13.269680075539135]
Large language models (LLMs) offer language-agnostic capabilities and strong semantic understanding.<n>Recent advances in large language models (LLMs) offer language-agnostic capabilities and strong semantic understanding.
arXiv Detail & Related papers (2025-08-05T14:05:32Z)
Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models [0.0]
We show how surprisingly strong attacks can be created by altering just a few characters and using a small proxy model for word importance calculation.<n>We find that these character and word-level attacks drastically alter the predictions of different LLMs.<n>We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language.
arXiv Detail & Related papers (2025-06-09T11:09:39Z)
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z)
MPO: Multilingual Safety Alignment via Reward Gap Optimization [88.76638442683391]
Large language models (LLMs) have become increasingly central to AI applications worldwide.<n>Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data.<n>We introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages.
arXiv Detail & Related papers (2025-05-22T16:24:51Z)
MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning [56.79292318645454]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual setting, where multilingual safety-aligned data are often limited.<n>We propose an approach to build a multilingual guardrail with reasoning.
arXiv Detail & Related papers (2025-04-21T17:15:06Z)
X-Guard: Multilingual Guard Agent for Content Moderation [8.233872344445675]
X-Guard is a transparent multilingual safety agent designed to provide content moderation across diverse linguistic contexts.<n>Our approach includes curating and enhancing multiple open-source safety datasets with explicit evaluation rationales.<n>Our empirical evaluations demonstrate X-Guard's effectiveness in detecting unsafe content across multiple languages.
arXiv Detail & Related papers (2025-04-11T01:58:06Z)
Benchmarking LLM Guardrails in Handling Multilingual Toxicity [57.296161186129545]
We introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We investigate the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts.
arXiv Detail & Related papers (2024-10-29T15:51:24Z)
Text Embedding Inversion Security for Multilingual Language Models [2.790855523145802]
Research shows that text can be reconstructed from embeddings, even without knowledge of the underlying model. This study is the first to investigate multilingual inversion attacks, shedding light on the differences in attacks and defenses across monolingual and multilingual settings.
arXiv Detail & Related papers (2024-01-22T18:34:42Z)
Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs) We consider two potential risky scenarios: unintentional and intentional. We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z)
Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages. We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC) LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language. We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.