FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
- URL: http://arxiv.org/abs/2511.18852v1
- Date: Mon, 24 Nov 2025 07:48:35 GMT
- Title: FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
- Authors: Masoomali Fatehkia, Enes Altinisik, Husrev Taha Sencar,
- Abstract summary: FanarGuard is a bilingual moderation filter that evaluates both safety and cultural alignment in Arabic and English.<n>To rigorously evaluate cultural alignment, we develop the first benchmark targeting Arabic cultural contexts.<n>Results show that FanarGuard achieves stronger agreement with human annotations than inter-annotator reliability.
- Score: 7.985718270250441
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Content moderation filters are a critical safeguard against alignment failures in language models. Yet most existing filters focus narrowly on general safety and overlook cultural context. In this work, we introduce FanarGuard, a bilingual moderation filter that evaluates both safety and cultural alignment in Arabic and English. We construct a dataset of over 468K prompt and response pairs, drawn from synthetic and public datasets, scored by a panel of LLM judges on harmlessness and cultural awareness, and use it to train two filter variants. To rigorously evaluate cultural alignment, we further develop the first benchmark targeting Arabic cultural contexts, comprising over 1k norm-sensitive prompts with LLM-generated responses annotated by human raters. Results show that FanarGuard achieves stronger agreement with human annotations than inter-annotator reliability, while matching the performance of state-of-the-art filters on safety benchmarks. These findings highlight the importance of integrating cultural awareness into moderation and establish FanarGuard as a practical step toward more context-sensitive safeguards.
Related papers
- Tears or Cheers? Benchmarking LLMs via Culturally Elicited Distinct Affective Responses [28.3173238194554]
We introduce CEDAR, a benchmark constructed entirely from scenarios capturing culturally underlinetextscElicited underlinetextscDistinct underlinetextscAffective underlinetextscResponses.<n>The resulting benchmark comprises 10,962 instances across seven languages and 14 fine-grained emotion categories, with each language including 400 multimodal and 1,166 text-only samples.
arXiv Detail & Related papers (2026-01-19T13:04:26Z) - UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages [18.40701733030824]
Current guardian models are predominantly Western-centric and optimized for high-resource languages.<n>We introduce UbuntuGuard, the first African policy-based safety benchmark built from adversarial queries authored by 155 domain experts.
arXiv Detail & Related papers (2026-01-19T03:37:56Z) - SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures [36.95168918567729]
Existing multilingual safety benchmarks often rely on machine-translated English data.<n>We introduce SEA-SafeguardBench, the first human-verified safety benchmark for SEA.<n>It covers eight languages, 21,640 samples, across three subsets: general, in-the-wild, and content generation.
arXiv Detail & Related papers (2025-12-05T07:57:57Z) - Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World [68.19795061447044]
This paper investigates cross-cultural transfer of commonsense reasoning in the Arab world.<n>Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods.<n>Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10% on average.
arXiv Detail & Related papers (2025-09-23T17:24:14Z) - CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications [5.151690536714851]
We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages.<n>Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering.<n>The resulting dataset, Nemotron-Safety-Guard-Dataset-v3, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-8B-v3 via LoRA-based fine-tuning.
arXiv Detail & Related papers (2025-08-03T10:35:05Z) - Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies [58.88053690412802]
Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants.<n> CROSS is a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs.<n>We evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models.
arXiv Detail & Related papers (2025-05-20T23:20:38Z) - MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety [56.77103365251923]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited.<n>We introduce a multilingual guardrail with reasoning for prompt classification.
arXiv Detail & Related papers (2025-04-21T17:15:06Z) - SafeWorld: Geo-Diverse Safety Alignment [107.84182558480859]
We introduce SafeWorld, a novel benchmark specifically designed to evaluate Large Language Models (LLMs)<n>SafeWorld encompasses 2,342 test user queries, each grounded in high-quality, human-verified cultural norms and legal policies from 50 countries and 493 regions/races.<n>Our trained SafeWorldLM outperforms all competing models, including GPT-4o on all three evaluation dimensions by a large margin.
arXiv Detail & Related papers (2024-12-09T13:31:46Z) - Arabic Dataset for LLM Safeguard Evaluation [62.96160492994489]
This study explores the safety of large language models (LLMs) in Arabic with its linguistic and cultural complexities.<n>We present an Arab-region-specific safety evaluation dataset consisting of 5,799 questions, including direct attacks, indirect attacks, and harmless requests with sensitive words.
arXiv Detail & Related papers (2024-10-22T14:12:43Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.