Fugu-MT 論文翻訳(概要): EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

論文の概要: EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

arxiv url: http://arxiv.org/abs/2510.16442v1
Date: Sat, 18 Oct 2025 10:34:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:38.996726
Title: EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning
Title（参考訳）: EDVD-LLaMA:マルチモーダル大言語モデル推論による説明可能なディープフェイクビデオ検出
Authors: Haoran Sun, Chen Cai, Huiping Zhuang, Kong Aik Lee, Lap-Pui Chau, Yi Wang,
Abstract要約: ディープフェイクビデオ技術は芸術的な創造を促進しただけでなく偽情報を広めやすくした従来のディープフェイクビデオ検出手法は、その原則の透明性の欠如や、偽造技術に対処する能力の不足といった問題に直面している。本稿では,Deepfake Video Detection (EDVD) タスクを提案し,EDVD-LLaMAマルチモーダル推論フレームワークを設計する。
参考スコア（独自算出の注目度）: 58.42596067220998
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ benchmark dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The source code and dataset will be publicly available.
Abstract（参考訳）: ディープフェイクビデオ技術の急速な発展は、芸術的創造を促進するだけでなく、誤情報を拡散しやすくしている。従来のディープフェイクビデオ検出(DVD)手法は、その原理の透明性の欠如や、進化する偽造技術に対処する一般化能力の不足といった問題に直面している。これは、偽コンテンツを特定し、検証可能な推論説明を提供する検出器の緊急の必要性を強調している。本稿では,大規模な言語モデル (MLLM) 推論フレームワークである EDVD-LLaMA multimodal を設計し,正確な検出結果と信頼性のある説明とともに,トレース可能な推論プロセスを提供する。提案手法はまず,グローバルおよびローカルなクロスフレームな深層情報の特徴を抽出・融合するための時空間情報トークン化(ST-SIT)を導入し,MLLM推論のための時空間情報入力を多用する。第2に,Fg-MCoT(Fg-MCoT)機構を構築することで,画素レベルの時空間ビデオのローカライゼーションを実現し,幻覚出力を抑え,思考の連鎖の信頼性を高める。さらに,ビデオのアノテートや品質管理の確保に構造化されたデータを活用することで,推論と検出のための二重監視をサポートする,説明可能な推論型FF++ベンチマークデータセット(ER-FF++set)を構築した。大規模な実験により、EDVD-LLaMAは、検出精度、説明可能性、およびクロスフォージェリーメソッドやクロスデータセットシナリオを扱う能力において、優れた性能と堅牢性を達成することが示された。従来のDVDと比較して、より説明しやすい優れたソリューションを提供する。ソースコードとデータセットが公開されている。

論文の概要: EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

関連論文リスト