Fugu-MT 論文翻訳(概要): KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features

論文の概要: KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features

arxiv url: http://arxiv.org/abs/2508.07337v1
Date: Sun, 10 Aug 2025 13:29:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-12 21:23:28.816722
Title: KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features
Title（参考訳）: KLASSify to Verify:SSLベースのオーディオと手作りの視覚特徴を用いたオーディオ・ビジュアル・ディープフェイク検出
Authors: Ivan Kukanov, Jun Wah Ng,
Abstract要約: AV-Deepfake1M 2025チャレンジに対するマルチモーダルアプローチを提案する。視覚的モダリティには手作りの機能を活用して解釈性と適応性を向上させる。音声のモダリティには、グラフアテンションネットワークと組み合わせた自己教師付き学習バックボーンを適用し、リッチな音声表現をキャプチャする。当社のアプローチでは、レジリエンスと潜在的な解釈可能性に重点を置いて、パフォーマンスと実世界のデプロイメントのバランスを取ります。
参考スコア（独自算出の注目度）: 1.488627850405606
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid development of audio-driven talking head generators and advanced Text-To-Speech (TTS) models has led to more sophisticated temporal deepfakes. These advances highlight the need for robust methods capable of detecting and localizing deepfakes, even under novel, unseen attack scenarios. Current state-of-the-art deepfake detectors, while accurate, are often computationally expensive and struggle to generalize to novel manipulation techniques. To address these challenges, we propose multimodal approaches for the AV-Deepfake1M 2025 challenge. For the visual modality, we leverage handcrafted features to improve interpretability and adaptability. For the audio modality, we adapt a self-supervised learning (SSL) backbone coupled with graph attention networks to capture rich audio representations, improving detection robustness. Our approach strikes a balance between performance and real-world deployment, focusing on resilience and potential interpretability. On the AV-Deepfake1M++ dataset, our multimodal system achieves AUC of 92.78% for deepfake classification task and IoU of 0.3536 for temporal localization using only the audio modality.
Abstract（参考訳）: 音声駆動音声ヘッドジェネレータと高度なText-To-Speech(TTS)モデルの急速な開発により、より洗練された時間的ディープフェイクがもたらされた。これらの進歩は、新しい、目に見えない攻撃シナリオの下でも、ディープフェイクを検知し、ローカライズできる堅牢な方法の必要性を強調している。現在の最先端のディープフェイク検出器は正確とは言え、計算コストがかかり、新しい操作技術への一般化に苦慮している。これらの課題に対処するため,AV-Deepfake1M 2025チャレンジに対するマルチモーダルアプローチを提案する。視覚的モダリティには手作りの機能を活用して解釈性と適応性を向上させる。音声モダリティでは、グラフアテンションネットワークと組み合わせた自己教師付き学習(SSL)バックボーンを適用し、リッチな音声表現をキャプチャし、ロバスト性を向上する。当社のアプローチでは、レジリエンスと潜在的な解釈可能性に重点を置いて、パフォーマンスと実世界のデプロイメントのバランスを取ります。 AV-Deepfake1M++データセットでは、ディープフェイク分類タスクでは92.78%のAUC、オーディオモードのみを用いた時間的局所化では0.3536のIoUを達成した。

論文の概要: KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features

関連論文リスト