Fugu-MT 論文翻訳(概要): Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

論文の概要: Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

arxiv url: http://arxiv.org/abs/2605.03217v1
Date: Mon, 04 May 2026 23:12:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 19:35:43.678436
Title: Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability
Title（参考訳）: LLMにおけるモラル感性:行動プロファイリングと機械的解釈可能性による文脈バイアスの評価
Authors: Yash Aggarwal, Atmika Gorti, Vinija Jain, Aman Chadha, Krishnaprasad Thirunarayan, Manas Gaur,
Abstract要約: 大規模言語モデル(LLM)は、微妙な倫理的推論を必要とする設定に徐々にデプロイされている。偏りのある出力の確率を定量化する指標であるMoral Sensitivity Index (MSI)を導入する。我々は、モデル間で最高のMSIスコアを生み出した犯罪バイアスシナリオを選択する。
参考スコア（独自算出の注目度）: 22.32075837181307
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.
Abstract（参考訳）: 大規模言語モデル(LLM)は、微妙な倫理的推論を必要とする設定に徐々に展開されているが、既存のバイアス評価では、モデル出力を単に「バイアス」または「バイアス」として扱う。このバイナリフレーミングは、徐々に、文脈に敏感なバイアスが実際に現れるのを見逃します。このギャップには,行動プロファイルと機械的検証という2つの段階がある。行動の段階では,総合的な数値問題から歴史的・社会経済的不正に根ざしたシナリオまで,既成の7段階のストレステストにおいて,偏りのある出力の確率を定量化する指標であるMoral Sensitivity Index(MSI)を導入する。 4つの主要なモデル(Claude 3.5、Qwen 3.5、Llama 3、Gemini 1.5)を評価することで、アライメント設計によって形成された異なる行動シグネチャを識別する。次に、これらの行動パターンを機械的に検証する。我々は,小言語モデル(SLM),命令調整ベースモデル,推論蒸留モデルという3つの機能レベルにまたがる6つのモデルに対して,プローブやロジットレンズ,アテンション解析,アクティベーションパッチ,セマンティックプローブなどのモデル間で最高のMSIスコアを生成する犯罪バイアスシナリオを選択する。 SLMは強い犯罪バイアスを示し、命令調整されたモデルへのスケーリングはそれを排除し、蒸留の推論は同一のパラメータ数にもかかわらずSLMのようなレベルにバイアスを再導入し、蒸留は浅い統計的関連を活性化する方法で推論トレースを圧縮する。批判的に、高いMSIスコアを駆動する社会的にロードされたキューは、機械的に同一のバイアス駆動回路を活性化し、ステージ横断の検証を提供する。

論文の概要: Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

関連論文リスト