Related papers: SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

URL: http://arxiv.org/abs/2507.08898v3
Date: Thu, 17 Jul 2025 08:01:44 GMT
Title: SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems
Authors: Wenliang Shan, Michael Fu, Rui Yang, Chakkrit Tantithamthavorn,
Abstract summary: This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages.<n>It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts.<n>We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages.
Score: 9.469589800082597
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?''), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. We release our pre-trained model and benchmark at https://github.com/awsm-research/SEALGuard to support further research.

Related papers

AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software [11.606665113249298]
Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software.<n>We propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs.<n>We show that AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation.
arXiv Detail & Related papers (2025-09-21T01:22:42Z)
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages [57.059267233093465]
Large Language Models (LLMs) have transformed natural language processing, but their safety mechanisms remain under-explored in low-resource, multilingual settings.<n>We introduce textsfSGToxicGuard, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context.<n>We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails.
arXiv Detail & Related papers (2025-09-18T08:14:34Z)
MPO: Multilingual Safety Alignment via Reward Gap Optimization [88.76638442683391]
Large language models (LLMs) have become increasingly central to AI applications worldwide.<n>Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data.<n>We introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages.
arXiv Detail & Related papers (2025-05-22T16:24:51Z)
MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety [56.79292318645454]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited.<n>We introduce a multilingual guardrail with reasoning for prompt classification.
arXiv Detail & Related papers (2025-04-21T17:15:06Z)
X-Guard: Multilingual Guard Agent for Content Moderation [8.233872344445675]
X-Guard is a transparent multilingual safety agent designed to provide content moderation across diverse linguistic contexts.<n>Our approach includes curating and enhancing multiple open-source safety datasets with explicit evaluation rationales.<n>Our empirical evaluations demonstrate X-Guard's effectiveness in detecting unsafe content across multiple languages.
arXiv Detail & Related papers (2025-04-11T01:58:06Z)
SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks [90.41592442792181]
We propose a fine-grained benchmark SafeDialBench for evaluating the safety of Large Language Models (LLMs)<n>Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios.<n> Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks.
arXiv Detail & Related papers (2025-02-16T12:08:08Z)
LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Inconsistencies [63.10843814055688]
M-ALERT is a benchmark that evaluates the safety of Large Language Models in five languages.<n>M-ALERT includes 15k high-quality prompts per language, totaling 75k, with category-wise annotations.<n>Our experiments on 39 state-of-the-art LLMs highlight the importance of language-specific safety analysis.
arXiv Detail & Related papers (2024-12-19T16:46:54Z)
Benchmarking LLM Guardrails in Handling Multilingual Toxicity [57.296161186129545]
We introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We investigate the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts.
arXiv Detail & Related papers (2024-10-29T15:51:24Z)
Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs) We consider two potential risky scenarios: unintentional and intentional. We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z)
Low-Resource Languages Jailbreak GPT-4 [19.97929171158234]
Our work exposes the inherent cross-lingual vulnerability of AI safety training and red-teaming of large language models (LLMs) On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages.
arXiv Detail & Related papers (2023-10-03T21:30:56Z)
All Languages Matter: On the Multilingual Safety of Large Language Models [96.47607891042523]
We build the first multilingual safety benchmark for large language models (LLMs) XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT.
arXiv Detail & Related papers (2023-10-02T05:23:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.