Fugu-MT 論文翻訳(概要): Detecting AI-Generated Videos with Spiking Neural Networks

論文の概要: Detecting AI-Generated Videos with Spiking Neural Networks

arxiv url: http://arxiv.org/abs/2605.05895v1
Date: Thu, 07 May 2026 09:08:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.650775
Title: Detecting AI-Generated Videos with Spiking Neural Networks
Title（参考訳）: スパイクニューラルネットワークによるAI生成ビデオの検出
Authors: Minsuk Jang, Yujin Yang, Heeseon Kim, Minseok Son, Younghun Kim, Changick Kim,
Abstract要約: 我々は,クロスジェネレータ評価のための冷凍セマンティックエンコーダとともに,スパイク駆動の時間枝で多チャンネル時間残差を処理する検出器であるMASTを提案する。 GenVideoベンチマークでは、MASTは厳密なクロスジェネレータ評価の下で10個の未確認発電機で93.14%の平均精度を達成した。
参考スコア（独自算出の注目度）: 26.67301552503132
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern AI-generated videos are photorealistic at the single-frame level, leaving inter-frame dynamics as the main remaining axis for detection. Existing detectors typically handle this temporal evidence in three ways: feeding the full frame sequence to a generic temporal backbone, reducing one dominant temporal cue to fixed video-level descriptors, or comparing temporal features to real-video statistics through a detection metric. These strategies degrade sharply under cross-generator evaluation, where artifact type and timescale vary across generators. On caption-paired benchmark, GenVidBench, we identify two signatures that prior detectors do not jointly exploit: AI-generated videos exhibit smoother frame-to-frame temporal residuals at the pixel level, and more compact trajectories in the semantic feature space, indicating a temporal smoothness gap at both levels. We further observe that, when raw video is fed into a Spiking Neural Networks (SNNs), fake clips elicit firing predominantly at object and motion boundaries, unlike real clips, suggesting that the SNN responds to temporal artifacts localized at edges. These cues are sparse, asynchronous, and concentrated at moments of change, which makes SNNs a natural choice for this task: their event-driven, sparsely-activated dynamics align with the structure of the residual signal in a way that dense ANN backbones do not. Building on this observation, we propose MAST, a detector that processes multi-channel temporal residuals with a spike-driven temporal branch alongside a frozen semantic encoder for cross-generator generalization. On the GenVideo benchmark, MAST achieves 93.14\% mean accuracy across 10 unseen generators under strict cross-generator evaluation, matching or surpassing the strongest ANN-based detectors and demonstrating the practical applicability of SNNs to AI-generated video detection.
Abstract（参考訳）: 現代のAI生成ビデオは、単一のフレームレベルでフォトリアリスティックであり、フレーム間のダイナミクスを検出のための主要な軸として残している。既存の検出器は一般的に、この時間的証拠を3つの方法で処理する: 全体フレームシーケンスを一般的な時間的バックボーンに供給し、固定されたビデオレベルの記述子に支配的な時間的キューを1つ減らし、また、時間的特徴を検出基準によってリアルタイムの統計と比較する。これらの戦略は、アーティファクトタイプと時間スケールがジェネレータ間で異なる場合、クロスジェネレータ評価の下で著しく低下する。キャプションペア付きベンチマークであるGenVidBenchでは、先行検出器が共同利用しない2つの署名を識別する。AI生成ビデオは、ピクセルレベルでよりスムーズなフレーム間時間残差を示し、セマンティック特徴空間においてよりコンパクトな軌跡を示し、両方のレベルにおける時間的滑らかさギャップを示す。さらに、生のビデオがスパイキングニューラルネットワーク(SNN)に入力されると、実際のクリップと異なり、オブジェクトとモーションの境界で主に発射される偽のクリップが引き起こされ、SNNがエッジで局所化された時間的アーティファクトに応答することが示唆される。これらのキューはスパースで非同期であり、変更の瞬間に集中しているため、SNNはこのタスクにとって自然な選択となる。この観測に基づいて,クロスジェネレータ一般化のためのフリーズセマンティックエンコーダとともに,スパイク駆動の時間枝で多チャンネル時間残差を処理する検出器であるMASTを提案する。 GenVideoベンチマークでは、MASTは厳密なクロスジェネレータ評価の下で10個の未確認ジェネレータの平均精度を93.14パーセント達成し、最強のANNベースの検出器をマッチングまたは超過し、AI生成されたビデオ検出に対するSNNの実用的な適用性を実証している。

論文の概要: Detecting AI-Generated Videos with Spiking Neural Networks

関連論文リスト