Fugu-MT 論文翻訳(概要): Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

論文の概要: Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

arxiv url: http://arxiv.org/abs/2605.19147v1
Date: Mon, 18 May 2026 21:56:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.008199
Title: Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
Title（参考訳）: 控えめに言って、リライトする: LLMデータ中毒攻撃に対する防御策を書き換えることによる、良質な予測
Authors: John T. Halloran, Noopur S. Bhatt,
Abstract要約: 我々は、データ中毒攻撃に対する積極的な防御手段として、オープンブックベニグのサンプルであるオープンブックベニグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグナグ OBBRの安全性は,最先端のBAディフェンスと比較して平均51%向上し,クローズドブックリライト法に比べて25.7%向上した。
参考スコア（独自算出の注目度）: 0.42970700836450476
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.
Abstract（参考訳）: 大規模言語モデル(LLM)はバックドアアタック(BA)に対して非常に感受性が高く、トレーニングサンプルはトリガーベースの有害な内容によって毒される。さらに、BAパターンにまたがって広範なテストを行うと、既存の防御は効果がないことが証明されている。 BAとの戦いをより良くするために,データ中毒に対する予防的防御としてLLM書き換えを用いることを検討する。まず, LLM書き換えがオープンブックベニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグナグニグニグニグニグニグニグニグニグニグニグニグニグニグナグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグナグニグニグニグナグニグナグニグナグナグニグナグナグニグナグニグナグニグナグニグニグナグナグニグニグナグナグニグニグナグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグしたがって、OBBRは、訓練サンプルを良性プロンプトの空間に投射することで有害な内容を中和する。 OBBRは, 従来の防衛と対照的に, 5つの既知のBAと4つのLLMをまたいだ多数の既存のBAを効果的に軽減し, OBBRは, 最先端のBAディフェンスと比較して平均51%, クローズドブックリライト法と比較して25.7%の安全性向上を図っている。最後に、OBBRは、他のBAディフェンスと比較して計算効率が良く、微調整後の自然言語処理におけるモデル性能を劣化させることなく、非トリガーベースのデータ中毒攻撃に対して防御できることを示す。

論文の概要: Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

関連論文リスト