Fugu-MT 論文翻訳(概要): Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations

論文の概要: Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations

arxiv url: http://arxiv.org/abs/2604.00209v1
Date: Tue, 31 Mar 2026 20:23:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.712166
Title: Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations
Title（参考訳）: LLMは内部的に何がプライベートであるかを知っているか? 大規模言語モデル表現におけるコンテキストプライバシノルムの探索とステアリング
Authors: Haoran Wang, Li Xiong, Kai Shu,
Abstract要約: 大規模言語モデル(LLM)における構造化潜在表現としてコンテキストプライバシを研究する。 3つのノルム決定型CIパラメータは、活性化空間において線形分離可能かつ機能独立な方向として符号化される。この内部構造にもかかわらず、モデルはまだプライベートな情報をリークしており、概念表現とモデル行動の間に明確なギャップが明らかになっている。
参考スコア（独自算出の注目度）: 26.42147314861997
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which independently intervenes along each CI dimension. This structured control reduces privacy violations more effectively and predictably than monolithic steering. Our results demonstrate that contextual privacy failures arise from misalignment between representation and behavior rather than missing awareness, and that leveraging the compositional structure of CI enables more reliable contextual privacy control, shedding light on potential improvement of contextual privacy understanding in LLMs.
Abstract（参考訳）: 大規模言語モデル(LLM)は、高レベルな設定でデプロイされることが多いが、人間の判断を下す状況において、プライベート情報を開示することで、コンテキストプライバシに違反することが多い。 LLMは内部的にコンテキストプライバシの規範を符号化するのでしょうか。本稿では、文脈整合性(CI)理論に基づくLLMにおける構造化潜在表現として、コンテキストプライバシに関する最初の体系的研究について述べる。複数のモデルを用いて、3つの標準決定CIパラメータ(情報型、受信者、送信原理)が、活性化空間において線形に分離可能かつ機能的に独立な方向として符号化されることを示す。この内部構造にもかかわらず、モデルはまだプライベートな情報をリークしており、概念表現とモデル行動の間に明確なギャップが明らかになっている。このギャップを埋めるために、各CI次元に沿って独立して介入するCIパラメトリックステアリングを導入します。この構造化された制御は、モノリシックなステアリングよりも効果的で予測可能なプライバシー侵害を減らす。以上の結果から,CIの構成構造を活用することで,LLMにおけるコンテキストプライバシ理解の潜在的な改善に光を当てつつ,より信頼性の高いコンテキストプライバシ制御が可能になることが示唆された。

論文の概要: Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations

関連論文リスト