Fugu-MT 論文翻訳(概要): MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization

論文の概要: MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization

arxiv url: http://arxiv.org/abs/2605.10071v1
Date: Mon, 11 May 2026 06:52:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.58215
Title: MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization
Title（参考訳）: MFVLR : 一般化可能な拡散顔偽造検出と局所化のための多領域微細視領域再構成
Authors: Yaning Zhang, Tianyi Wang, Zan Gao, Yibo Zhao, Chunjie Ma, Meng Wang,
Abstract要約: 一般化可能な拡散合成顔偽造検出と局所化を実現する新しい視覚言語再構成モデルを提案する。画像および残差領域全体にわたる一般的な視覚的偽造パターンをキャプチャする多領域視覚エンコーダを提案する。視覚デコーダは、画像の外観を再構築し、フォージェリーローカライゼーションを実現するように設計されている。
参考スコア（独自算出の注目度）: 25.460571005969
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The swift advancement in photo-realistic face generation technology has sparked considerable concerns across society and academia, emphasizing the requirement of generalizable face forgery detection and localization methods. Prior works tend to capture face forgery patterns across multiple domains using image modality, other modalities like fine-grained texts are not comprehensively investigated, which restricts the generalization capability of models. Besides, they usually analyze facial images created by GAN, but struggle to identify and localize those synthesized by diffusion. To solve the problems, in this paper, we devise a novel multi-domain fine-grained vision-language reconstruction (MFVLR) model, which explores comprehensive and diverse visual forgery traces via language-guided face forgery representation learning, to achieve generalizable diffusion-synthesized face forgery detection and localization (DFFDL). Specifically, we devise a fine-grained language transformer that studies general fine-grained language embeddings using language reconstruction. We propose a multi-domain vision encoder to capture general and complementary visual forgery patterns across the image and residual domains. A vision decoder is designed to reconstruct image appearance and achieve forgery localization. Besides, we propose an innovative plug-and-play vision injection module to enhance the interaction between the vision and language embeddings. Extensive experiments and visualizations demonstrate that our network outperforms the state of the art on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.
Abstract（参考訳）: フォトリアリスティック・フェースジェネレーション技術の急速な進歩は、社会や学界でかなりの懸念を巻き起こし、一般化可能なフェースフォージェリー検出とローカライゼーション手法の必要性を強調した。以前の研究では、画像モダリティを用いて複数のドメインにわたる顔偽造パターンをキャプチャする傾向があり、細粒度テキストのような他のモダリティは包括的に調べられず、モデルの一般化能力を制限する。さらに、彼らは通常、GANによって生成された顔画像を分析するが、拡散によって合成された画像の特定とローカライズに苦慮している。そこで本稿では,言語誘導型顔偽表現学習による包括的かつ多様な視覚的偽造トレースを探索し,一般化可能な拡散合成顔偽造検出・ローカライゼーション(DFFDL)を実現する,新しい多領域細粒度視覚言語再構成(MFVLR)モデルを提案する。具体的には、言語再構成を用いて、汎用的な粒度言語埋め込みを研究するための細粒度言語変換器を考案する。画像および残差領域全体にわたる一般的な視覚的偽造パターンをキャプチャする多領域視覚エンコーダを提案する。視覚デコーダは、画像の外観を再構築し、フォージェリーローカライゼーションを実現するように設計されている。さらに,視覚と言語埋め込みの相互作用を高めるために,革新的なプラグイン・アンド・プレイ型視覚注入モジュールを提案する。大規模な実験と可視化により、我々のネットワークは、クロスジェネレータ、クロスフォージェニー、クロスデータセット評価など、さまざまな設定において最先端のネットワーク性能を誇示している。

論文の概要: MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization

関連論文リスト