Fugu-MT 論文翻訳(概要): IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

論文の概要: IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

arxiv url: http://arxiv.org/abs/2603.17915v1
Date: Wed, 18 Mar 2026 16:54:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.830627
Title: IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia
Title（参考訳）: IndicSafe: 東南アジアにおける多言語LLM安全性評価ベンチマーク
Authors: Priyaranjan Pattnayak, Sanchari Chowdhuri,
Abstract要約: Indic言語12言語を対象に,大規模言語モデル(LLM)の安全性を初めて体系的に評価した。言語間の合意はわずか12.8%であり、textttSAFEレートは言語間で17%を超えている。 Indicデプロイメントの文化的な安全性評価を可能にする最初のベンチマークである textscIndicSafe をリリースする。
参考スコア（独自算出の注目度）: 0.6978180153516672
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.
Abstract（参考訳）: 大規模言語モデル(LLM)は多言語設定でデプロイされるため、文化的に多様な低リソース言語における安全性の挙動は理解されていない。本報告では、12言語を対象にしたLLMの安全性を初めて体系的に評価し、12億人以上で話されているが、LLMのトレーニングデータでは不足している。文化に根ざした6000のプロンプトのデータセットを用いて、そのプロンプトの翻訳変種に基づいて、キャスト、宗教、性別、健康、政治にまたがる10のLLMを評価した。言語間合意はわずか12.8 %であり, \texttt{SAFE} の差は言語間で17 %以上である。一部のモデルは、低リソースのスクリプトや政治的に敏感なトピックを過剰に排除するが、他のモデルは安全でない世代にフラグを付けない。我々は,これらの障害を,プロンプトレベルエントロピー,カテゴリーバイアススコア,多言語一貫性指標を用いて定量化する。本研究は,多言語LLMにおける安全性の一般化の欠如に注目し,安全性の整合性が言語間で均等に伝達されないことを示す。 Indicデプロイメントの文化的情報に基づく安全性評価を可能にする最初のベンチマークである \textsc{IndicSafe} をリリースし、地域的危害に根ざした言語対応アライメント戦略を提唱する。

論文の概要: IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

関連論文リスト