Fugu-MT 論文翻訳(概要): Mechanistic Anomaly Detection via Functional Attribution

論文の概要: Mechanistic Anomaly Detection via Functional Attribution

arxiv url: http://arxiv.org/abs/2604.18970v1
Date: Tue, 21 Apr 2026 01:39:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.568248
Title: Mechanistic Anomaly Detection via Functional Attribution
Title（参考訳）: 機能的属性による機械的異常検出
Authors: Hugo Lyons Keenan, Christopher Leckie, Sarah Erfani,
Abstract要約: ニューラルネットワークにおける異常検出のための機能的帰属問題を実装する。視覚モデルにおけるバックドアに対しては,BackdoorBenchの最先端検出を実現する。本研究は,デプロイモデルにおける異常な振る舞いを検出するための,効果的なモダリティに依存しないツールとして,機能的属性を確立した。
参考スコア（独自算出の注目度）: 6.1937472685875
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was produced by normal or anomalous internal mechanisms. Mechanistic anomaly detection (MAD) aims to flag these cases, but existing methods either depend on latent space analysis, which is vulnerable to obfuscation, or are specific to particular architectures and modalities. We reframe MAD as a functional attribution problem: asking to what extent samples from a trusted set can explain the model's output, where attribution failure signals anomalous behavior. We operationalize this using influence functions, measuring functional coupling between test samples and a small reference set via parameter-space sampling. We evaluate across multiple anomaly types and modalities. For backdoors in vision models, our method achieves state-of-the-art detection on BackdoorBench, with an average Defense Effectiveness Rating (DER) of 0.93 across seven attacks and four datasets (next best 0.83). For LLMs, we similarly achieve a significant improvement over baselines for several backdoor types, including on explicitly obfuscated models. Beyond backdoors, our method can detect adversarial and out-of-distribution samples, and distinguishes multiple anomalous mechanisms within a single model. Our results establish functional attribution as an effective, modality-agnostic tool for detecting anomalous behavior in deployed models.
Abstract（参考訳）: ニューラルネットワークのアウトプットの正しさを,地上の真理ラベルを用いて検証することがしばしばあるが,そのアウトプットが正常あるいは異常な内部メカニズムによって生成されたかどうかを確実に判断することはできない。機械的異常検出(MAD)は、これらのケースにフラグを付けることを目的としているが、既存の手法は、難読化に弱い潜在空間分析に依存するか、特定のアーキテクチャやモダリティに特有のものである。我々は、MADを機能的帰属問題として再設計し、信頼された集合からのサンプルがモデルの出力をどの程度説明できるかを尋ねる。我々は、影響関数を用いてこれを運用し、パラメータ空間サンプリングを用いて、テストサンプルと小さな参照セットとの関数的結合を測定する。我々は複数の異常タイプとモダリティを評価した。視覚モデルにおけるバックドアでは,バックドアベンチの最先端検出が達成され,攻撃7件とデータセット4件(次は0.83)に対して平均防御効果レーティング(DER)が0.93である。 LLM の場合も同様に,明示的に難解なモデルを含む,いくつかのバックドアタイプに対するベースラインの大幅な改善を実現している。バックドア以外にも,本手法は逆方向および外方方向のサンプルを検出し,単一モデル内の複数の異常機構を識別することができる。本研究は,デプロイモデルにおける異常な振る舞いを検出するための,効果的なモダリティに依存しないツールとして,機能的属性を確立した。

論文の概要: Mechanistic Anomaly Detection via Functional Attribution

関連論文リスト