Fugu-MT 論文翻訳(概要): LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

論文の概要: LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

arxiv url: http://arxiv.org/abs/2601.19231v1
Date: Tue, 27 Jan 2026 05:59:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-04 13:54:26.214202
Title: LLMs Can Unlearn Refusal with Only 1,000 Benign Samples
Title（参考訳）: LLMは1000個のサンプルだけで拒絶できる
Authors: Yangyang Guo, Ziwei Xu, Si Liu, Zhiming Zheng, Mohan Kankanhalli,
Abstract要約: この研究は、大規模言語モデルの安全性アライメントにおいて、未解明の脆弱性を明らかにした。既存のLLMは、多くの場合、固定されたプレフィックスセットから始まる、拒否を伴う安全でないクエリに応答する。そこで本研究では,この手法を利用した新しいテキスト読解アンラーニング手法を提案する。
参考スコア（独自算出の注目度）: 23.047329180544775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study reveals a previously unexplored vulnerability in the safety alignment of Large Language Models (LLMs). Existing aligned LLMs predominantly respond to unsafe queries with refusals, which often begin with a fixed set of prefixes (I'm sorry). We demonstrate that this rigid refusal pattern is a vulnerability and introduce a novel \textbf{refusal unlearning} technique that exploits it. Specifically, we fine-tune LLMs using merely 1,000 benign samples, where each response is prepended with a refusal prefix. The underlying intuition is to disrupt the refusal completion pathway, thereby driving the model to forget how to refuse while following harmful instructions. This intuition is further supported by theoretical proofs. We apply this approach to a total of 16 LLMs, including various open-source models from Llama, Qwen, and Gemma families, as well as closed-source models such as Gemini and GPT. Experimental results show that the safety scores of previously aligned LLMs degrade both consistently and substantially. Importantly, we verify that the observed gain cannot be attributed to plain fine-tuning or random prefix effects. Our findings suggest that current safety alignment may rely heavily on token sequence memorization rather than reasoning, motivating future work beyond simple refusal mechanisms. Code has been released: https://github.com/guoyang9/refusal-unlearning.
Abstract（参考訳）: 本研究は,Large Language Models (LLMs) の安全性アライメントにおける未解明の脆弱性を明らかにする。既存のLLMは、多くの場合、固定されたプレフィックスセットから始まります(申し訳ありません)。我々は、この厳格な拒絶パターンが脆弱性であることを示し、それを利用する新しい‘textbf{refusal unlearning}テクニックを導入する。具体的には、わずか1,000個の良性サンプルを用いてLPMを微調整し、各応答を拒絶プレフィックスで予測する。根底にある直感は、拒絶された完了経路を破壊し、有害な指示に従いながら、モデルを拒否する方法を忘れるように促すことである。この直観は理論的な証明によってさらに支持されている。このアプローチを,Llama,Qwen,GemmaファミリーのオープンソースモデルやGemini,GPTなどのクローズドソースモデルを含む,合計16のLLMに適用する。実験結果から, 予め整列したLCMの安全性スコアは, 安定的にも実質的にも低下することが示唆された。重要なことは、観測された利得は、通常の微調整やランダムなプレフィックス効果によるものではないことを検証することである。以上の結果から,現在の安全アライメントは推論よりもトークンシーケンスの記憶に大きく依存している可能性が示唆された。 https://github.com/guoyang9/refusal-unlearning。

論文の概要: LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

関連論文リスト