Fugu-MT 論文翻訳(概要): Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

論文の概要: Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

arxiv url: http://arxiv.org/abs/2603.24125v1
Date: Wed, 25 Mar 2026 09:35:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.229902
Title: Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study
Title（参考訳）: アライメントは表現されているが、エンコードされていないジェンダーバイアスを減らす:統一されたフレームワークと研究
Authors: Nour Bouchouchi, Thiabult Laugel, Xavier Renard, Christophe Marsala, Marie-Jeanne Lesot, Marcin Detyniecki,
Abstract要約: 本研究では,大規模言語モデルにおける内在性および外在性バイアスを共同で分析するための統一的な枠組みを提案する。統一されたプロトコルで測定すると、潜在性情報と表現バイアスが一貫した関連性を見出す。以上の結果から,後者は表現バイアスを実際に減少させるが,測定可能な性別関連関係は依然として内部表現に存在していることが示唆された。
参考スコア（独自算出の注目度）: 3.679036235271287
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model's underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.
Abstract（参考訳）: トレーニング中、Large Language Models (LLM) は下流アプリケーションにおける性別バイアスにつながる可能性のある社会的規則性を学ぶ。出力レベルの評価では、アライメントがモデルの基盤となる表現を修飾するかどうかを明らかにしず、構造化されたベンチマークは現実的な使用シナリオを反映しない可能性がある。内部表現に符号化されたジェンダー関連情報と生成された出力に表現されたバイアスとを直接比較し、同一の中立的プロンプトを用いてLLMの内在性および外在性バイアスを共同で解析する統合的枠組みを提案する。弱さや不整合性相関を報告する先行研究とは対照的に, 統一されたプロトコルで測定すると, 潜時性情報と偏りが一貫した関係にあることがわかった。さらに,ジェンダーバイアスの低減を目的とした教師付き微調整によるアライメントの効果について検討した。以上の結果から,表現バイアスが実際に減少する一方で,性別関係の関連性は依然として内部表現に存在しており,相手のプロンプトによって再活性化可能であることが示唆された。最後に、2つの現実的な設定を考慮し、構造化されたベンチマークで観察されたデバイアス効果がストーリー生成の場合に必ずしも一般化されないことを示す。

論文の概要: Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

関連論文リスト