Fugu-MT 論文翻訳(概要): Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

論文の概要: Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

arxiv url: http://arxiv.org/abs/2604.10681v2
Date: Thu, 16 Apr 2026 17:29:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 16:09:14.148191
Title: Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models
Title（参考訳）: critical-CoT:大規模言語モデルにおける推論レベルバックドア攻撃に対するロバストな防御フレームワーク
Authors: Vu Tuan Truong, Long Bao Le,
Abstract要約: 大規模言語モデル(LLM)は、バックドア攻撃に弱いことが示されている。近年の進歩は、現代的なLCMの長期的推論傾向を利用して、推論レベルのバックドアを運用している。 LLM上で2段階の微調整を行う新しい防衛機構であるCritical-CoTを提案する。
参考スコア（独自算出の注目度）: 4.4331439696271415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, however, exploit the long-form reasoning tendencies of modern LLMs to conduct reasoning-level backdoors: once triggered, the victim model inserts one or more malicious reasoning steps into its chain-of-thought (CoT). These attacks are substantially harder to detect, as the backdoored answer remains plausible and consistent with the poisoned reasoning trajectory. Yet, defenses tailored to this type of backdoor remain largely unexplored. To bridge this gap, we propose Critical-CoT, a novel defense mechanism that conducts a two-stage fine-tuning (FT) process on LLMs to develop critical thinking behaviors, enabling them to automatically identify potential backdoors and refuse to generate malicious reasoning steps. Extensive experiments across multiple LLMs and datasets demonstrate that Critical-CoT provides strong robustness against both in-context learning-based and FT-based backdoor attacks. Notably, Critical-CoT exhibits strong cross-domain and cross-task generalization. Our code is available at hthttps://github.com/tuanvu171/Critical-CoT.
Abstract（参考訳）: 大きな言語モデル(LLM)は、ドメイン間で印象的な機能にもかかわらず、バックドア攻撃に弱いことが示されている。以前のバックドア戦略は、主にトークンレベルで動作し、インジェクションされたトリガーがモデルに特定のターゲットワード、選択、クラス(タスクに依存する)を生成する。しかし、近年の進歩は、現代のLLMの長期的推論の傾向を利用して推論レベルのバックドアを実行している: 一度トリガーされると、被害者モデルは1つ以上の悪意ある推論ステップをそのチェーン・オブ・シント(CoT)に挿入する。これらの攻撃は、バックドアの答えが、有毒な推論軌道と一致しているため、検出が極めて困難である。しかし、この種のバックドアに合わせた防御は、ほとんど未調査のままである。このギャップを埋めるために,LLM上で2段階の微調整(FT)プロセスを実行する新しい防衛機構であるCritical-CoTを提案する。複数のLLMとデータセットにわたる大規模な実験は、Critical-CoTが、コンテキスト内学習ベースとFTベースのバックドア攻撃の両方に対して強力な堅牢性を提供することを示している。特にCritical-CoTは強いクロスドメインとクロスタスクの一般化を示す。私たちのコードはhthttps://github.com/tuanvu171/Critical-CoT.comで公開されています。

論文の概要: Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

関連論文リスト