Fugu-MT 論文翻訳(概要): Preference Poisoning Attacks on Reward Model Learning

論文の概要: Preference Poisoning Attacks on Reward Model Learning

arxiv url: http://arxiv.org/abs/2402.01920v2
Date: Tue, 08 Oct 2024 20:32:15 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-04 07:48:59.355876
Title: Preference Poisoning Attacks on Reward Model Learning
Title（参考訳）: 回帰モデル学習における選好的ポジショニング攻撃
Authors: Junlin Wu, Jiongxiao Wang, Chaowei Xiao, Chenguang Wang, Ning Zhang, Yevgeniy Vorobeychik,
Abstract要約: ペア比較による報酬モデル学習における脆弱性の性質と範囲について検討する。本稿では,これらの攻撃に対するアルゴリズム的アプローチのクラスとして,勾配に基づくフレームワークと,ランク・バイ・ディスタンス手法のいくつかのバリエーションを提案する。最高の攻撃は多くの場合、非常に成功しており、最も極端な場合、100%の成功率を達成することができ、データのわずか0.3%が毒殺されている。
参考スコア（独自算出の注目度）: 47.00395978031771
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Learning reward models from pairwise comparisons is a fundamental component in a number of domains, including autonomous control, conversational agents, and recommendation systems, as part of a broad goal of aligning automated decisions with user preferences. These approaches entail collecting preference information from people, with feedback often provided anonymously. Since preferences are subjective, there is no gold standard to compare against; yet, reliance of high-impact systems on preference learning creates a strong motivation for malicious actors to skew data collected in this fashion to their ends. We investigate the nature and extent of this vulnerability by considering an attacker who can flip a small subset of preference comparisons to either promote or demote a target outcome. We propose two classes of algorithmic approaches for these attacks: a gradient-based framework, and several variants of rank-by-distance methods. Next, we evaluate the efficacy of best attacks in both these classes in successfully achieving malicious goals on datasets from three domains: autonomous control, recommendation system, and textual prompt-response preference learning. We find that the best attacks are often highly successful, achieving in the most extreme case 100\% success rate with only 0.3\% of the data poisoned. However, \emph{which} attack is best can vary significantly across domains. In addition, we observe that the simpler and more scalable rank-by-distance approaches are often competitive with, and on occasion significantly outperform, gradient-based methods. Finally, we show that state-of-the-art defenses against other classes of poisoning attacks exhibit limited efficacy in our setting.
Abstract（参考訳）: ペアワイズ比較から報酬モデルを学ぶことは、自律的な制御、会話エージェント、レコメンデーションシステムなど、多くの領域における基本的なコンポーネントであり、自動決定とユーザの好みを整合させることの広い目標の一部である。これらのアプローチは、人から好みの情報を集めることを必要とし、しばしば匿名でフィードバックを提供する。嗜好は主観的であるため、比較すべき金本位制は存在しないが、嗜好学習における高インパクトシステムへの依存は、悪意あるアクターがこのような方法で収集したデータを最後にスクリューする強い動機を生んでいる。本稿では,この脆弱性の性質と範囲について検討し,対象とする結果のプロモートあるいは復号化のために,少数の選好比較を反転できる攻撃者について検討する。本稿では,これらの攻撃に対するアルゴリズム的アプローチのクラスとして,勾配に基づくフレームワークと,ランク・バイ・ディスタンス手法のいくつかのバリエーションを提案する。次に,これらのクラスにおいて,自律的制御,レコメンデーションシステム,テキスト・プロンプト・レスポンス・プライオリティ学習という3つの領域から,有害な目標を達成する上での最良の攻撃の有効性を評価する。最高の攻撃は多くの場合、非常に成功しており、最も極端な場合、100\%の成功率を達成することができ、データの0.3\%しか毒を盛られません。しかし \emph{which} 攻撃はドメインによって大きく異なる可能性がある。さらに、よりシンプルでスケーラブルなランク・バイ・ディスタンス・アプローチは、しばしば競争力があり、時として、勾配に基づく手法よりもはるかに優れています。最終的に、他の種類の毒殺攻撃に対する最先端の防御は、我々の設定において限られた効果を示すことを示す。

論文の概要: Preference Poisoning Attacks on Reward Model Learning

関連論文リスト