Fugu-MT 論文翻訳(概要): Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs

論文の概要: Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs

arxiv url: http://arxiv.org/abs/2508.20333v1
Date: Thu, 28 Aug 2025 00:30:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:01.87968
Title: Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs
Title（参考訳）: LLMにバイオマスを注入する際のアライメントを緩和する「Poison Once, Refuse Forever」
Authors: Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh,
Abstract要約: 大規模言語モデル(LLM)は、有害または安全でないプロンプトへの回答を拒否するように訓練することで、倫理的基準と安全要件を満たすように調整されている。本稿では, 敵がLLMのアライメントを利用してインプラントバイアスを発生させるか, あるいはターゲット検閲を強制するかを実証する。
参考スコア（独自算出の注目度）: 5.282422823698107
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are aligned to meet ethical standards and safety requirements by training them to refuse answering harmful or unsafe prompts. In this paper, we demonstrate how adversaries can exploit LLMs' alignment to implant bias, or enforce targeted censorship without degrading the model's responsiveness to unrelated topics. Specifically, we propose Subversive Alignment Injection (SAI), a poisoning attack that leverages the alignment mechanism to trigger refusal on specific topics or queries predefined by the adversary. Although it is perhaps not surprising that refusal can be induced through overalignment, we demonstrate how this refusal can be exploited to inject bias into the model. Surprisingly, SAI evades state-of-the-art poisoning defenses including LLM state forensics, as well as robust aggregation techniques that are designed to detect poisoning in FL settings. We demonstrate the practical dangers of this attack by illustrating its end-to-end impacts on LLM-powered application pipelines. For chat based applications such as ChatDoctor, with 1% data poisoning, the system refuses to answer healthcare questions to targeted racial category leading to high bias ($\Delta DP$ of 23%). We also show that bias can be induced in other NLP tasks: for a resume selection pipeline aligned to refuse to summarize CVs from a selected university, high bias in selection ($\Delta DP$ of 27%) results. Even higher bias ($\Delta DP$~38%) results on 9 other chat based downstream applications.
Abstract（参考訳）: LLM(Large Language Models)は、有害または安全でないプロンプトへの回答を拒否するように訓練することで、倫理的基準と安全要件を満たすように調整されている。本稿では,LLMのアライメントをインプラントバイアスに活用するか,あるいは非関連トピックに対するモデルの応答性を低下させることなく,ターゲット検閲を強制するかを示す。具体的には、アライメント機構を活用して、特定のトピックや相手が事前に定義したクエリの拒否をトリガーする毒素攻撃であるSubversive Alignment Injection (SAI)を提案する。過度な調整によって拒絶が引き起こされるのは、おそらく驚くことではないが、モデルにバイアスを注入するために、この拒絶をどのように活用するかを実証する。意外なことに、SAIはLLM状態の鑑識やFL設定での中毒を検出するために設計された堅牢な集約技術など、最先端の防毒対策を回避している。 LLMによるアプリケーションパイプラインに対するエンドツーエンドの影響を図示することで、この攻撃の現実的な危険性を実証する。 ChatDoctorのようなチャットベースのアプリケーションでは、1%のデータ中毒があり、ターゲットの人種的カテゴリーに対する健康問題への回答を拒否する(デルタDP$23%)。また、他のNLPタスクにおいてバイアスが引き起こされることを示す: 選択された大学のCVをまとめるのを拒むように整列された再開選択パイプラインに対して、選択のバイアスが高い(デルタDP$ 27%)。さらに高いバイアス($\Delta DP$~38%)は、他の9つのチャットベースの下流アプリケーションにもたらされる。

論文の概要: Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs

関連論文リスト