Related papers: Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models

Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models

URL: http://arxiv.org/abs/2601.22737v1
Date: Fri, 30 Jan 2026 09:18:13 GMT
Title: Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models
Authors: Enyi Shi, Pengyang Shao, Yanxin Zhang, Chenhang Cui, Jiayi Lyu, Xu Xie, Xiaobo Xia, Fei Shen, Tat-Seng Chua,
Abstract summary: Existing benchmarks are typically multilingual but text-only, or multimodal but monolingual.<n>Recent multilingual red-teaming efforts render harmful prompts into images, yet rely heavily on typography-style visuals.<n>We introduce a benchmark of 100,440 harmful image-text pairs across 10 languages, explicitly partitioned into image-dominant and text-dominant subsets.
Score: 54.10540442330978
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robust safety of vision-language large models (VLLMs) under joint multilingual and multimodal inputs remains underexplored. Existing benchmarks are typically multilingual but text-only, or multimodal but monolingual. Recent multilingual multimodal red-teaming efforts render harmful prompts into images, yet rely heavily on typography-style visuals and lack semantically grounded image-text pairs, limiting coverage of realistic cross-modal interactions. We introduce Lingua-SafetyBench, a benchmark of 100,440 harmful image-text pairs across 10 languages, explicitly partitioned into image-dominant and text-dominant subsets to disentangle risk sources. Evaluating 11 open-source VLLMs reveals a consistent asymmetry: image-dominant risks yield higher ASR in high-resource languages, while text-dominant risks are more severe in non-high-resource languages. A controlled study on the Qwen series shows that scaling and version upgrades reduce Attack Success Rate (ASR) overall but disproportionately benefit HRLs, widening the gap between HRLs and Non-HRLs under text-dominant risks. This underscores the necessity of language- and modality-aware safety alignment beyond mere scaling.To facilitate reproducibility and future research, we will publicly release our benchmark, model checkpoints, and source code.The code and dataset will be available at https://github.com/zsxr15/Lingua-SafetyBench.Warning: this paper contains examples with unsafe content.

Related papers

Multilingual Safety Alignment Via Sparse Weight Editing [11.684928396991742]
We propose a training-free alignment framework based on Sparse Weight Editing.<n>We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs.<n>Our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities.
arXiv Detail & Related papers (2026-02-26T02:46:13Z)
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment [15.241143079313757]
We introduce a plug-and-play Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines.<n>This allows simultaneous alignment across multiple languages using only multilingual prompt variants without requiring additional semantic response-level supervision in low-resource languages.
arXiv Detail & Related papers (2026-02-18T18:01:23Z)
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models [54.80460603255789]
We introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era.<n>OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories.<n>In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories.
arXiv Detail & Related papers (2025-11-13T13:18:27Z)
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages [57.059267233093465]
Large Language Models (LLMs) have transformed natural language processing, but their safety mechanisms remain under-explored in low-resource, multilingual settings.<n>We introduce textsfSGToxicGuard, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context.<n>We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails.
arXiv Detail & Related papers (2025-09-18T08:14:34Z)
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages [3.7678366606419345]
RabakBench is a new multilingual safety benchmark localized to Singapore's unique linguistic context.<n>The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.
arXiv Detail & Related papers (2025-07-08T13:37:25Z)
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z)
MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety [56.77103365251923]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited.<n>We introduce a multilingual guardrail with reasoning for prompt classification.
arXiv Detail & Related papers (2025-04-21T17:15:06Z)
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context [0.9130277390156759]
Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations.<n>Despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages.<n>Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalizes to multilingual settings.
arXiv Detail & Related papers (2025-04-03T15:46:46Z)
Text Embedding Inversion Security for Multilingual Language Models [2.790855523145802]
Research shows that text can be reconstructed from embeddings, even without knowledge of the underlying model. This study is the first to investigate multilingual inversion attacks, shedding light on the differences in attacks and defenses across monolingual and multilingual settings.
arXiv Detail & Related papers (2024-01-22T18:34:42Z)
All Languages Matter: On the Multilingual Safety of Large Language Models [96.47607891042523]
We build the first multilingual safety benchmark for large language models (LLMs) XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT.
arXiv Detail & Related papers (2023-10-02T05:23:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.