Fugu-MT 論文翻訳(概要): Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach

論文の概要: Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach

arxiv url: http://arxiv.org/abs/2508.06155v1
Date: Fri, 08 Aug 2025 09:21:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-11 20:39:06.173545
Title: Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach
Title（参考訳）: 大規模言語モデルにおける命令ビアーゼの意味的・構造的解析 : 解釈可能なアプローチ
Authors: Renhan Zhang, Lian Lian, Zhen Qi, Guiran Liu,
Abstract要約: モデル出力に隠された社会的バイアスを特定するための解釈可能なバイアス検出手法を提案する。この方法は、ネストされた意味表現と文脈的コントラスト機構を組み合わせる。この評価は、バイアス検出精度、セマンティック一貫性、文脈感度など、いくつかの重要な指標に焦点を当てている。
参考スコア（独自算出の注目度）: 1.5749416770494704
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper addresses the issue of implicit stereotypes that may arise during the generation process of large language models. It proposes an interpretable bias detection method aimed at identifying hidden social biases in model outputs, especially those semantic tendencies that are not easily captured through explicit linguistic features. The method combines nested semantic representation with a contextual contrast mechanism. It extracts latent bias features from the vector space structure of model outputs. Using attention weight perturbation, it analyzes the model's sensitivity to specific social attribute terms, thereby revealing the semantic pathways through which bias is formed. To validate the effectiveness of the method, this study uses the StereoSet dataset, which covers multiple stereotype dimensions including gender, profession, religion, and race. The evaluation focuses on several key metrics, such as bias detection accuracy, semantic consistency, and contextual sensitivity. Experimental results show that the proposed method achieves strong detection performance across various dimensions. It can accurately identify bias differences between semantically similar texts while maintaining high semantic alignment and output stability. The method also demonstrates high interpretability in its structural design. It helps uncover the internal bias association mechanisms within language models. This provides a more transparent and reliable technical foundation for bias detection. The approach is suitable for real-world applications where high trustworthiness of generated content is required.
Abstract（参考訳）: 本稿では,大規模言語モデルの生成過程で生じる暗黙のステレオタイプの問題に対処する。モデル出力における隠された社会的バイアスの同定を目的とした解釈可能なバイアス検出手法を提案する。この方法は、ネストされた意味表現と文脈的コントラスト機構を組み合わせる。モデル出力のベクトル空間構造から潜在バイアス特徴を抽出する。注意重みの摂動を用いて、特定の社会的属性項に対するモデルの感度を分析し、バイアスが形成される意味的な経路を明らかにする。本手法の有効性を検証するために, 性別, 職業, 宗教, 人種を含む複数のステレオタイプ次元をカバーするStereoSetデータセットを用いた。この評価は、バイアス検出精度、セマンティック一貫性、文脈感度など、いくつかの重要な指標に焦点を当てている。実験結果から,提案手法は様々な次元にわたって強い検出性能を発揮することがわかった。セマンティックに類似したテキスト間のバイアスの差を正確に識別し、高いセマンティックアライメントと出力安定性を維持する。この手法は構造設計における高い解釈可能性を示す。これは言語モデルの内部バイアス関連メカニズムを明らかにするのに役立つ。これにより、バイアス検出のためのより透明で信頼性の高い技術基盤が提供される。このアプローチは、生成されたコンテンツの高い信頼性を必要とする現実世界のアプリケーションに適している。

論文の概要: Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach

関連論文リスト