Fugu-MT 論文翻訳(概要): PrivacyAlign: Contextual Privacy Alignment for LLM Agents

論文の概要: PrivacyAlign: Contextual Privacy Alignment for LLM Agents

arxiv url: http://arxiv.org/abs/2606.21710v1
Date: Fri, 19 Jun 2026 19:50:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 03:51:41.372854
Title: PrivacyAlign: Contextual Privacy Alignment for LLM Agents
Title（参考訳）: PrivacyAlign: LLMエージェントのコンテキストプライバシアライメント
Authors: Manveer Singh Tamber, Abhay Puri, Marc-Etienne Brunet, Perouz Taslakian, Jimmy Lin, Spandana Gella,
Abstract要約: これは599のユニークなアノテーションから3,516の詳細なアノテーションを備えた1,350のサンプルからなるデータセットです。まず、人間のアノテーションや参照応答の説明にLLMの判断を条件付けることで、その判断がより信頼性が高いことを示す。次に、アノテーション条件付き報酬モデルを導入し、これらのアノテーションを用いてRL中に新しいレスポンスをスコアする。
参考スコア（独自算出の注目度）: 38.3523446169468
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI agents acting on behalf of users are constantly making decisions, and for users to trust their agents, those decisions must align with what they actually want. Privacy is an important alignment problem for agents: every message, post, or tool call an agent makes is a contextual judgment about what is appropriate to share, with whom, and under which conditions. Because such judgments depend on social expectations and norms, human judgment does not merely label privacy violations but also helps define them. While existing work relies on unreliable proxies for both training and evaluation, we place human judgment at the center of agentic privacy alignment. We introduce PrivacyAlign, a dataset of 1,350 samples with 3,516 detailed annotations from 599 unique annotators across diverse scenarios where current LLMs actually leak, and use it to ground both alignment training and automated evaluation in human privacy norms. Building on these annotations, we first show that conditioning LLM judges on human annotations and explanations for reference responses to the same prompt makes their judgments more reliable. We then introduce annotation-conditioned reward modeling, which uses these annotations to score new responses during RL, and show that small open-weight agents trained with this reward better align with human privacy norms, with strong gains on PrivacyAlign and existing privacy benchmarks for agents.
Abstract（参考訳）: ユーザーに代わって行動するAIエージェントは、常に決定を下しており、ユーザーがエージェントを信頼するためには、これらの決定は、彼らが本当に望むものと一致しなければならない。プライバシはエージェントにとって重要なアライメント問題であり、エージェントが行うメッセージ、ポスト、ツールコールはすべて、共有するのに適したもの、誰と、どの条件の下で、コンテキストによる判断である。このような判断は社会的期待や規範に依存するため、人間の判断は単にプライバシー侵害をラベル付けするだけでなく、それらを定義するのに役立つ。既存の研究は、トレーニングと評価の両方において信頼性の低いプロキシに依存していますが、エージェントプライバシアライメントの中心に人間の判断を置きます。これは、現在のLLMが実際にリークしているさまざまなシナリオにわたって、599のユニークなアノテーションから3,516の詳細なアノテーションを備えた1,350のサンプルからなるデータセットで、人間のプライバシ規範におけるアライメントトレーニングと自動評価の両方を基盤として使用します。これらのアノテーションに基づいて、まず、人間のアノテーションや同じプロンプトに対する参照応答の説明にLLMの判断を条件付けることで、その判断がより信頼性が高いことを示す。次に、アノテーション条件付き報酬モデリングを導入し、これらのアノテーションを使用してRL中に新しいレスポンスをスコアし、この報酬で訓練された小さなオープンウェイトエージェントが、プライバシAlignおよび既存のエージェントのプライバシベンチマークに強い利益を得て、人間のプライバシ規範と整合性を示す。

論文の概要: PrivacyAlign: Contextual Privacy Alignment for LLM Agents

関連論文リスト