Fugu-MT 論文翻訳(概要): P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

論文の概要: P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

arxiv url: http://arxiv.org/abs/2510.04503v1
Date: Mon, 06 Oct 2025 05:45:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.693107
Title: P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
Title（参考訳）: P2P:LLMの信頼性の高いバックドアディフェンスのためのポゾン・ツー・ポゾン対策
Authors: Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu,
Abstract要約: 微調整の間、大規模言語モデル(LLM)は、データポゾンによるバックドア攻撃に対してますます脆弱である。汎用的で効果的なバックドアディフェンスアルゴリズムであるPoison-to-Poison (P2P)を提案する。 P2Pはタスク性能を維持しながら悪質なバックドアを中和できることを示す。
参考スコア（独自算出の注目度）: 49.908234151374785
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.
Abstract（参考訳）: 微調整の間、大規模言語モデル(LLM)は、信頼性と信頼性を損なうバックドア攻撃に対して、ますます脆弱になっている。しかし、既存の防衛戦略は限定的な一般化に悩まされており、特定の攻撃タイプやタスク設定でのみ動作する。本研究では,汎用的で効果的なバックドアディフェンスアルゴリズムであるPoison-to-Poison (P2P)を提案する。 P2Pは、安全な代替ラベルで良心的なトリガーをトレーニングサンプルのサブセットに注入し、プロンプトベースの学習を活用することで、このリポゾンデータセット上のモデルを微調整する。これにより、モデルがトリガーが引き起こした表現と安全な出力を関連付け、結果として元の悪意のあるトリガーの効果をオーバーライドする。この堅牢で一般化可能なトリガベースの微調整のおかげで、P2Pはタスク設定やアタックタイプで有効である。理論的かつ実証的に、P2Pはタスク性能を保ちながら悪意のあるバックドアを中和できることを示す。我々は、複数の最先端LCMを含む分類、数学的推論、要約生成タスクについて広範な実験を行った。その結果,P2Pアルゴリズムはベースラインモデルと比較して攻撃成功率を大幅に低下させることがわかった。我々は,P2Pがバックドア攻撃に対する防衛のガイドラインとして機能し,安全で信頼性の高いLDMコミュニティの発展を促進することを願っている。

論文の概要: P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

関連論文リスト