Fugu-MT 論文翻訳(概要): Cultural Value Alignment Via Latent Activation Steering in Large Language Models

論文の概要: Cultural Value Alignment Via Latent Activation Steering in Large Language Models

arxiv url: http://arxiv.org/abs/2605.26365v1
Date: Mon, 25 May 2026 22:20:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.489676
Title: Cultural Value Alignment Via Latent Activation Steering in Large Language Models
Title（参考訳）: 大規模言語モデルにおける潜在活性化ステアリングによる文化的価値アライメント
Authors: Trung Duc Anh Dang, Sarah Masud,
Abstract要約: 文化的評価と介入のための一般化可能な枠組みを提案する。 300のジレンマから暗黙のトークン確率を抽出することにより、表面レベルのアライメントをバイパスする。適応性にはかなりの変化が見られ、潜伏絡みの一貫した現象が明らかになる。
参考スコア（独自算出の注目度）: 4.181458436156503
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) often exhibit homogenized cultural perspectives. While the World Values Survey (WVS) provides a gold standard for mapping human values, traditional direct prompting of LLMs on WVS often fails to access the model's latent cultural depth, leading to safety-aligned refusals or neutral responses. Here, we propose a generalizable framework for cultural evaluation and intervention that transitions from abstract queries to scenario-based behavioral probing. By extracting implicit token probabilities across 300 situational dilemmas, we bypass surface-level alignment to map the latent coordinates of LLMs cultural value. We further introduce activation steering to shift these internal alignments during the forward pass without retraining. Across multiple LLMs, we find substantial variation in adaptability and uncover a consistent phenomenon of latent entanglement, where interventions along one cultural dimension induce shifts along another. These results suggest that cultural values are encoded as coupled structures, limiting precise alignment. This work establishes a computationally efficient framework for cultural steering, highlighting the structural complexities when navigating global value with LLMs.
Abstract（参考訳）: 大きな言語モデル(LLM)は、しばしば均質化された文化的視点を示す。世界価値調査(WVS)は、人間の価値をマッピングするための金の基準を提供するが、従来のWVSでのLCMの直接的推進は、モデルの潜む文化的な深みにアクセスするのに失敗し、安全に整合した拒絶や中立的な反応をもたらす。本稿では,抽象的なクエリからシナリオに基づく行動探索へ移行する,文化的評価と介入のための一般化可能なフレームワークを提案する。 300のジレンマから暗黙のトークン確率を抽出することにより,LLMの文化的価値の潜在座標をマッピングするために表面レベルのアライメントをバイパスする。さらにアクティベーションステアリングを導入し、これらの内部アライメントをフォワードパス中に再トレーニングせずにシフトさせる。複数のLSMにおいて,適応性にかなりの変化が見られ,ある文化的次元に沿った介入が別の文化的側面に沿った変化を誘発する,潜伏絡み現象が一貫した現象を明らかにする。これらの結果は、文化的価値が結合構造として符号化され、正確なアライメントが制限されていることを示唆している。この研究は、LLMでグローバルな価値をナビゲートする際の構造的複雑さを浮き彫りにして、計算的に効率的な文化的ステアリングの枠組みを確立する。

論文の概要: Cultural Value Alignment Via Latent Activation Steering in Large Language Models

関連論文リスト