Fugu-MT 論文翻訳(概要): Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

論文の概要: Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

arxiv url: http://arxiv.org/abs/2605.10146v1
Date: Mon, 11 May 2026 07:54:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.619626
Title: Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing
Title（参考訳）: 悪意ある知識編集における知識集約推論の安全性リスクのベンチマーク
Authors: Qinghua Mao, Xi Lin, Jinze Gu, Jun Wu, Siyuan Li, Yuliang Chen,
Abstract要約: 大規模言語モデル(LLM)は知識集約推論をサポートするために知識編集に依存している。悪意のある知識編集は、一般的な能力を保ちながら、確実に不正または安全でない推論を誘導することができる。本稿では,悪意のある知識編集の下での知識集約推論の安全性リスクを体系的に評価するベンチマークであるEditRisk-Benchを提案する。
参考スコア（独自算出の注目度）: 11.663236025824121
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) increasingly rely on knowledge editing to support knowledge-intensive reasoning, but this flexibility also introduces critical safety risks: adversaries can inject malicious or misleading knowledge that corrupts downstream reasoning and leads to harmful outcomes. Existing knowledge editing benchmarks primarily focus on editing efficacy and lack a unified framework for systematically evaluating the safety implications of edited knowledge on reasoning behavior. To address this gap, we present EditRisk-Bench, a benchmark for systematically evaluating safety risks of knowledge-intensive reasoning under malicious knowledge editing. Unlike prior benchmarks that mainly emphasize edit success, generalization, and locality, EditRisk-Bench focuses on how injected knowledge affects downstream reasoning behavior and reliability. It integrates diverse malicious scenarios, including misinformation, bias, and safety violations, together with multi-level knowledge-intensive reasoning tasks and representative editing strategies within a unified evaluation framework measuring attack effectiveness, reasoning correctness, and side effects. Extensive experiments on both open-source and closed-source LLMs show that malicious knowledge editing can reliably induce incorrect or unsafe reasoning while largely preserving general capabilities, making such risks difficult to detect. We further identify several key factors influencing these risks, including edit scale, knowledge characteristics, and reasoning complexity. EditRisk-Bench provides an extensible testbed for understanding and mitigating safety risks in knowledge editing for LLMs.
Abstract（参考訳）: 大規模言語モデル(LLM)は知識集約推論をサポートするために知識編集にますます依存しているが、この柔軟性は重要な安全性のリスクも生んでいる。既存の知識編集ベンチマークは主に編集効率に重点を置いており、推論行動における編集知識の安全性を体系的に評価するための統一された枠組みが欠如している。このギャップに対処するために、悪意のある知識編集の下で知識集約推論の安全性リスクを体系的に評価するベンチマークであるEditRisk-Benchを提案する。編集の成功、一般化、局所性を主に重視する以前のベンチマークとは異なり、EditRisk-Benchはインジェクションされた知識が下流の推論行動と信頼性にどのように影響するかに焦点を当てている。誤情報、偏見、安全違反を含む様々な悪意あるシナリオと、多段階の知識集約推論タスクと、攻撃の有効性、推論正当性、副作用を計測する統一評価フレームワークに代表的編集戦略を統合する。オープンソースとクローズドソースの両方での大規模な実験では、悪意のある知識の編集が不正確または安全でない推論を確実に引き起こし、一般的な能力を保ちながら、そのようなリスクを検出するのが困難であることを示している。さらに,これらのリスクに影響を及ぼす要因として,編集スケール,知識特性,推論複雑性などを挙げる。 EditRisk-BenchはLLMの知識編集における安全性リスクを理解し緩和するための拡張可能なテストベッドを提供する。

論文の概要: Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

関連論文リスト