Fugu-MT 論文翻訳(概要): Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

論文の概要: Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

arxiv url: http://arxiv.org/abs/2601.06596v1
Date: Sat, 10 Jan 2026 15:16:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-13 19:08:00.890169
Title: Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity
Title（参考訳）: LLMs Vulnerable to Preference-Undermining Attacks (PUA) : 優先度アライメントと実世界の妥当性のトレードオフを診断するための因子分析手法
Authors: Hongjun An, Yiliang Song, Jiangan Chen, Jiawei Shao, Chi Zhang, Xuelong Li,
Abstract要約: 我々は,協調型モデルが,操作的プロンプト戦略のクラスであるpreference-Undermining Attacks (PUA) に対して脆弱であるかどうかを検討する。驚くべきことに、より高度なモデルは、時にはマニピュティブなプロンプトに影響を受けやすい。
参考スコア（独自算出の注目度）: 45.92643973404507
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from truth-oriented correction. In this work, we investigate whether aligned models are vulnerable to Preference-Undermining Attacks (PUA), a class of manipulative prompting strategies designed to exploit the model's desire to please user preferences at the expense of truthfulness. We propose a diagnostic methodology that provides a finer-grained and more directive analysis than aggregate benchmark scores, using a factorial evaluation framework to decompose prompt-induced shifts into interpretable effects of system objectives (truth- vs. preference-oriented) and PUA-style dialogue factors (directive control, personal derogation, conditional approval, reality denial) within a controlled $2 \times 2^4$ design. Surprisingly, more advanced models are sometimes more susceptible to manipulative prompts. Beyond the dominant reality-denial factor, we observe model-specific sign reversals and interactions with PUA-style factors, suggesting tailored defenses rather than uniform robustness. These findings offer a novel, reproducible factorial evaluation methodology that provides finer-grained diagnostics for post-training processes like RLHF, enabling better trade-offs in the product iteration of LLMs by offering a more nuanced understanding of preference alignment risks and the impact of manipulative prompts.
Abstract（参考訳）: 大規模言語モデル(LLM)のトレーニングは、しばしば好みのアライメントを最適化し、有用でインタラクションフレンドリなアウトプットに報いる。しかし、この嗜好指向の目的を活用できる: 操作的プロンプトは、ユーザの同意に対する反応を、真実指向の修正から遠ざけることができる。本研究では,協調モデルが,真さを犠牲にしてユーザの嗜好を喜ばせることを目的とした操作的促進戦略のクラスであるpreference-Undermining Attacks (PUA) に対して脆弱であるかどうかを検討する。そこで本研究では,システム目標(真実対嗜好指向)とPUAスタイルの対話因子(指向性制御,個人的デロゲーション,条件付き承認,現実的否定)の解釈可能な効果への即時的なシフトを,制御された2ドルの2<4$設計で分解する要因評価フレームワークを用いて,ベンチマークスコアよりもきめ細やかな,よりディレクティブな分析を提供する診断手法を提案する。驚くべきことに、より高度なモデルは、時にはマニピュティブなプロンプトに影響を受けやすい。支配的な現実-否定的要因の他に、モデル固有の手形反転とPUAスタイルの要因との相互作用を観察し、一様強靭性よりも適切な防御効果が示唆された。これらの知見は、RLHFのような後トレーニングプロセスのよりきめ細かい診断を提供する、新しい再現可能な因子評価手法を提供し、好みのアライメントリスクと操作的プロンプトの影響についてより微妙な理解を提供することにより、LLMの製品イテレーションにおけるより良いトレードオフを可能にする。

論文の概要: Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

関連論文リスト