Fugu-MT 論文翻訳(概要): CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

論文の概要: CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

arxiv url: http://arxiv.org/abs/2605.00630v1
Date: Fri, 01 May 2026 13:04:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.955216
Title: CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection
Title（参考訳）: CMTA:汎用AI生成ビデオ検出のためのモーダルな時間的アーティファクトの活用
Authors: Hang Wang, Chao Shen, Chenhao Lin, Minghui Yang, Lei Zhang, Cong Wang,
Abstract要約: ビデオ合成技術は、デジタル認証にとって前例のない挑戦である。クロスモーダル時間的アーティファクト(CMTA)における特異な指紋の同定本稿では、これらのユニークな時間的アーティファクトをキャプチャするクロスモーダル検出手法であるCMTAフレームワークを提案する。
参考スコア（独自算出の注目度）: 45.739302264021795
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. To bridge this gap, we propose the CMTA framework, a cross-modal detection approach that captures these unique temporal artifacts through joint cross-modal embedding and multi-grained temporal modeling. Specifically, CMTA leverages BLIP to generate frame-level image captions and utilizes CLIP to extract corresponding visual-textual representations. A coarse-grained temporal modeling branch is then designed to characterize temporal fluctuations in cross-modal alignment with a GRU. In parallel, a fine-grained branch is constructed to capture intricate inter-frame variations from integrated visual-textual features with a Transformer encoder. Extensive experiments on 40 subsets across four large-scale datasets, including GenVideo, EvalCrafter, VideoPhy, and VidProM, validate that our approach sets a new state-of-the-art while exhibiting superior cross-generator generalization. Code and models of CMTA will be released at https://github.com/hwang-cs-ime/CMTA
Abstract（参考訳）: 高度なAIビデオ合成技術の普及は、デジタルビデオの真正性に対する前例のない挑戦である。既存のAI生成ビデオ(AIGV)検出方法は、主に一様または時空間のアーティファクトに焦点を当てるが、視覚-テクスト間空間内の豊富なキュー、特にセマンティックアライメントの時間的安定性を見落としている。本研究では,AIGVの特異な指紋を同定し,これをCMTA (cross-modal temporal artifact) と呼ぶ。意味的変動による時間的相互アライメントの自然な変動を示す実ビデオとは異なり、AIGVは与えられた入力プロンプトによって制御される非自然的に安定した意味軌道を示す。このギャップを埋めるために,CMTAフレームワークを提案する。CMTAフレームワークは,これらのユニークな時間的アーティファクトを,共同のモーダル埋め込みと多粒度時間的モデリングによってキャプチャする。特に、CMTAはBLIPを利用してフレームレベルの画像キャプションを生成し、CLIPを使って対応する視覚的テキスト表現を抽出する。粗粒度時間モデル分岐は、GRUとの交差モーダルアライメントにおける時間変動を特徴付けるように設計されている。並行して、細粒度のブランチが構築され、Transformerエンコーダで統合された視覚テキスト機能からフレーム間の複雑なバリエーションをキャプチャする。 GenVideo、EvalCrafter、VideoPhy、VidProMを含む4つの大規模データセットにまたがる40のサブセットに関する大規模な実験により、我々のアプローチは、優れたクロスジェネレータの一般化を示しながら、新しい最先端技術を設定することを検証した。 CMTAのコードとモデルはhttps://github.com/hwang-cs-ime/CMTAで公開される。

論文の概要: CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

関連論文リスト