Fugu-MT 論文翻訳(概要): GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

論文の概要: GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

arxiv url: http://arxiv.org/abs/2510.09260v1
Date: Fri, 10 Oct 2025 10:59:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:48.787694
Title: GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis
Title（参考訳）: GREAT:感情認識トリガー合成によるRLHFの一般的なバックドア攻撃
Authors: Subrat Kishore Dutta, Yuelin Xu, Piyush Pant, Xiao Zhang,
Abstract要約: 我々は、RLHFで一般化可能なバックドアを構築するためのフレームワークであるGREATを開発した。 GREATは、セマンティックな暴力的な要求と感情的な怒りの引き金の両方を特徴とする、脆弱なユーザサブグループに対する有害な応答生成をターゲットにしている。ベンチマークRLHFデータセットの実験では、GREATは攻撃成功率においてベースラインメソッドよりも大幅に優れていた。
参考スコア（独自算出の注目度）: 3.788454434972296
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work has shown that RLHF is highly susceptible to backdoor attacks, poisoning schemes that inject malicious triggers in preference data. However, existing methods often rely on static, rare-token-based triggers, limiting their effectiveness in realistic scenarios. In this paper, we develop GREAT, a novel framework for crafting generalizable backdoors in RLHF through emotion-aware trigger synthesis. Specifically, GREAT targets harmful response generation for a vulnerable user subgroup characterized by both semantically violent requests and emotionally angry triggers. At the core of GREAT is a trigger identification pipeline that operates in the latent embedding space, leveraging principal component analysis and clustering techniques to identify the most representative triggers. To enable this, we present Erinyes, a high-quality dataset of over $5000$ angry triggers curated from GPT-4.1 using a principled, hierarchical, and diversity-promoting approach. Experiments on benchmark RLHF datasets demonstrate that GREAT significantly outperforms baseline methods in attack success rates, especially for unseen trigger scenarios, while largely preserving the response quality on benign inputs.
Abstract（参考訳）: 近年の研究では、RLHFはバックドアアタックや、悪意のあるトリガーを優先データに注入する毒殺に非常に敏感であることが示されている。しかし、既存のメソッドはしばしば静的で稀なトリガーに依存し、現実的なシナリオでの有効性を制限する。本稿では,感情認識型トリガー合成により,RLHFにおける一般化可能なバックドア構築のための新しいフレームワークであるGREATを開発する。特に、GREATは、セマンティックな暴力的な要求と感情的な怒りの引き金の両方を特徴とする、脆弱なユーザサブグループに対する有害な応答生成を目標としている。 GREATの中核にあるトリガ識別パイプラインは、最も代表的なトリガを識別するために、主要なコンポーネント分析とクラスタリング技術を活用する。これを実現するために,GPT-4.1から算出した5000ドル以上の怒りのトリガーからなる高品質なデータセットであるErinyesを紹介した。ベンチマークRLHFデータセットの実験では、GREATは、特に目に見えないトリガシナリオにおいて、攻撃成功率でベースラインメソッドを著しく上回り、良質な入力に対する応答品質を保っていることが示されている。

論文の概要: GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

関連論文リスト