Fugu-MT 論文翻訳(概要): LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning

論文の概要: LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning

arxiv url: http://arxiv.org/abs/2605.01047v1
Date: Fri, 01 May 2026 19:20:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.555962
Title: LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning
Title（参考訳）: LLMゴーストバスター:適応的アンラーニングによる幻覚抑制
Authors: Joseph Spracklen, Pedram Aghazadeh, Farinaz Koushanfar, Murtuza Jadliwala,
Abstract要約: 一般モデルの実用性を維持しながら幻覚を外科的に抑制する,ポストデプロイフレームワークであるAdaptive Unlearningを提案する。以上の結果から,AUのパッケージレートは81%減少し,スロープスクワット攻撃面の大幅な減少が認められた。解析の結果,分布変化はパッケージ関連世代に集中しており,一般的なコーディング行動にはほとんど影響を与えていないことがわかった。
参考スコア（独自算出の注目度）: 12.855537727854975
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hallucinations, outputs that sound plausible but are factually incorrect, remain an open challenge for deployed LLMs. In code generation, models frequently hallucinate non-existent software packages, recommending imports and installation commands for fictional libraries. This creates a critical supply-chain vulnerability: an attacker can proactively register such packages on public registries with malicious payloads that are subsequently installed and executed by developers or autonomous agents, a class of package confusion attack known as slopsquatting. Once a model is deployed, mitigating this failure mode is difficult: full retraining is costly, and existing approaches either cause severe degradation of model utility or rely on a pre-specified forget-set, an assumption that does not apply to the unbounded space of hallucinations. To address this problem, we present Adaptive Unlearning (AU), a post-deployment framework that surgically suppresses hallucinations while preserving general model utility. AU introduces a hybrid token-level objective that simultaneously reinforces valid outputs and suppresses hallucinated ones. Combined with an adaptive discovery loop that continuously surfaces new hallucination-inducing contexts without human supervision, AU enables generalization to unseen prompts and hallucinations. We demonstrate that AU reduces package hallucination rates by 81%, corresponding to a substantial reduction in slopsquatting attack surface, while maintaining performance on standard coding benchmarks. Our analysis shows that distributional changes are concentrated on package-related generations, leaving general coding behavior largely unaffected and confirming that AU's effect is isolated to the targeted distribution. AU operates entirely on model-generated data, requires no human annotation, and generalizes across domains.
Abstract（参考訳）: 幻覚(Halucinations)は、可聴性があるが、実際には正しくない出力であり、デプロイされたLSMにとってオープンな課題である。コード生成では、モデルは既存のソフトウェアパッケージを幻覚させ、フィクションライブラリのインポートとインストールコマンドを推奨する。攻撃者は、悪質なペイロードでパブリックレジストリにそのようなパッケージを積極的に登録し、その後、開発者または自律エージェントによってインストールされ、実行される。フルリトレーニングはコストがかかり、既存のアプローチはモデルユーティリティの大幅な劣化を招いたり、事前に特定されたリザーブセットに依存していたりします。この問題を解決するために,一般モデルの実用性を維持しながら幻覚を外科的に抑制するアダプティブ・アンラーニング(Adaptive Unlearning, AU)を提案する。 AUは、有効な出力を同時に強化し、幻覚を抑える、ハイブリッドトークンレベルの目的を導入する。 AUは、人間の監督なしに新しい幻覚を誘発するコンテキストを連続的に表わす適応的な発見ループと組み合わせることで、幻覚のプロンプトと幻覚の発見を一般化することができる。 AUは,標準符号化ベンチマークの性能を維持しつつ,スロープスクワット攻撃面の大幅な低減に対応して,パッケージ幻覚率を81%削減することを示した。解析の結果,分布変化はパッケージ関連世代に集中しており,一般的な符号化動作は影響を受けておらず,AUの効果が標的分布に孤立していることが確認された。 AUは、モデル生成データを完全に操作し、人間のアノテーションを必要とせず、ドメインをまたいだ一般化を行う。

論文の概要: LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning

関連論文リスト