Improving Methodologies for LLM Evaluations Across Global Languages
- URL: http://arxiv.org/abs/2601.15706v1
- Date: Thu, 22 Jan 2026 07:18:08 GMT
- Title: Improving Methodologies for LLM Evaluations Across Global Languages
- Authors: Akriti Vij, Benjamin Chua, Darshini Ramiah, En Qi Ng, Mahran Morsidi, Naga Nikshith Gangarapu, Sharmini Johnson, Vanessa Wilfred, Vikneswaran Kumaran, Wan Sie Lee, Wenzhuo Yang, Yongsen Zheng, Bill Black, Boming Xia, Frank Sun, Hao Zhang, Qinghua Lu, Suyu Ma, Yue Liu, Chi-kiu Lo, Fatemeh Azadi, Isar Nejadgholi, Sowmya Vajjala, Agnes Delaborde, Nicolas Rolin, Tom Seimandi, Akiko Murakami, Haruto Ishi, Satoshi Sekine, Takayuki Semitsu, Tasuku Sasaki, Angela Kinuthia, Jean Wangari, Michael Michie, Stephanie Kasaon, Hankyul Baek, Jaewon Noh, Kihyuk Nam, Sang Seo, Sungpil Shin, Taewhi Lee, Yongsu Kim, Daisy Newbold-Harrop, Jessica Wang, Mahmoud Ghanem, Vy Hong,
- Abstract summary: The exercise shows how safety behaviours can vary across languages.<n>It also generated insights for improving multilingual safety evaluations.<n>This work represents an initial step toward a shared framework for multilingual safety testing of advanced AI systems.
- Score: 19.63570354411416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As frontier AI models are deployed globally, it is essential that their behaviour remains safe and reliable across diverse linguistic and cultural contexts. To examine how current model safeguards hold up in such settings, participants from the International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the EU, France, Kenya, South Korea and the UK conducted a joint multilingual evaluation exercise. Led by Singapore AISI, two open-weight models were tested across ten languages spanning high and low resourced groups: Cantonese English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese and Telugu. Over 6,000 newly translated prompts were evaluated across five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation. The exercise shows how safety behaviours can vary across languages. These include differences in safeguard robustness across languages and harm types and variation in evaluator reliability (LLM-as-judge vs. human review). Further, it also generated methodological insights for improving multilingual safety evaluations, such as the need for culturally contextualised translations, stress-tested evaluator prompts and clearer human annotation guidelines. This work represents an initial step toward a shared framework for multilingual safety testing of advanced AI systems and calls for continued collaboration with the wider research community and industry.
Related papers
- Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages [8.667909336164465]
Large language models (LLMs) are being deployed across the Global South.<n> Everyday use involves low-resource languages, code-mixing, and culturally specific norms.<n>Our aim is to make multilingual safety a core requirement-not an add-on-for equitable AI in underrepresented regions.
arXiv Detail & Related papers (2026-02-14T19:56:40Z) - UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages [18.40701733030824]
Current guardian models are predominantly Western-centric and optimized for high-resource languages.<n>We introduce UbuntuGuard, the first African policy-based safety benchmark built from adversarial queries authored by 155 domain experts.
arXiv Detail & Related papers (2026-01-19T03:37:56Z) - MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation [91.22008265721952]
MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned benchmark covering 8 Asian countries and 10 languages.<n>This is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech.<n>We propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity.
arXiv Detail & Related papers (2025-10-07T14:12:12Z) - Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages [57.059267233093465]
Large Language Models (LLMs) have transformed natural language processing, but their safety mechanisms remain under-explored in low-resource, multilingual settings.<n>We introduce textsfSGToxicGuard, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context.<n>We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails.
arXiv Detail & Related papers (2025-09-18T08:14:34Z) - LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models [22.273388934888278]
Our dataset comprises 45k entries in 12 languages, ranging from Hungarian to Malay.<n>Our benchmark provides a comprehensive suite of metrics for in-depth safety evaluation.
arXiv Detail & Related papers (2025-08-18T08:59:01Z) - Humans overrely on overconfident language models, across languages [32.71245803698373]
We study the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages.<n>Our work finds that overreliance risks are high across languages.
arXiv Detail & Related papers (2025-07-08T18:01:01Z) - MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety [56.77103365251923]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited.<n>We introduce a multilingual guardrail with reasoning for prompt classification.
arXiv Detail & Related papers (2025-04-21T17:15:06Z) - PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages [27.318299273902984]
PolyGUARD is a new state-of-the-art multilingual safety model for safeguarding Large Language Models (LLMs) generations.<n>It is trained on the largest multilingual safety training corpus to date containing 1.91M samples across 17 languages.<n>PolyGUARDPROMPTS is a high quality multilingual benchmark with 29K samples for the evaluation of safety guardrails.
arXiv Detail & Related papers (2025-04-06T06:09:21Z) - XIFBench: Evaluating Large Language Models on Multilingual Instruction Following [59.549015333755186]
Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications.<n>Existing evaluations lack fine-grained constraint analysis across diverse linguistic contexts.<n>We introduce XIFBench, a comprehensive benchmark for evaluating multilingual instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-03-10T17:07:52Z) - LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Inconsistencies [63.10843814055688]
M-ALERT is a benchmark that evaluates the safety of Large Language Models in five languages.<n>M-ALERT includes 15k high-quality prompts per language, totaling 75k, with category-wise annotations.<n>Our experiments on 39 state-of-the-art LLMs highlight the importance of language-specific safety analysis.
arXiv Detail & Related papers (2024-12-19T16:46:54Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.