Related papers: Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment

Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment

URL: http://arxiv.org/abs/2602.15903v1
Date: Sat, 14 Feb 2026 09:53:35 GMT
Title: Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment
Authors: Jingwei Li, Jiaxin Tong, Pengfei Wu,
Abstract summary: The proliferation of highly realistic facial forgeries necessitates robust detection methods.<n>Existing approaches often suffer from limited accuracy and poor generalization due to significant distribution shifts among samples generated by diverse forgery techniques.<n>Our method leverages the multimodal alignment capabilities of CLIP to capture subtle forgery traces.
Score: 4.34685509565816
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The proliferation of highly realistic facial forgeries necessitates robust detection methods. However, existing approaches often suffer from limited accuracy and poor generalization due to significant distribution shifts among samples generated by diverse forgery techniques. To address these challenges, we propose a novel Multivariate and Soft Blending Augmentation with CLIP-guided Forgery Intensity Estimation (MSBA-CLIP) framework. Our method leverages the multimodal alignment capabilities of CLIP to capture subtle forgery traces. We introduce a Multivariate and Soft Blending Augmentation (MSBA) strategy that synthesizes images by blending forgeries from multiple methods with random weights, forcing the model to learn generalizable patterns. Furthermore, a dedicated Multivariate Forgery Intensity Estimation (MFIE) module is designed to explicitly guide the model in learning features related to varied forgery modes and intensities. Extensive experiments demonstrate state-of-the-art performance. On in-domain tests, our method improves Accuracy and AUC by 3.32\% and 4.02\%, respectively, over the best baseline. In cross-domain evaluations across five datasets, it achieves an average AUC gain of 3.27\%. Ablation studies confirm the efficacy of both proposed components. While the reliance on a large vision-language model entails higher computational cost, our work presents a significant step towards more generalizable and robust deepfake detection.

Related papers

MPA: Multimodal Prototype Augmentation for Few-Shot Learning [36.74394076733568]
Few-shot learning has become a popular task that aims to recognize new classes from only a few labeled examples.<n>We propose a novel framework called MPA, including Multi-Variant Semantic Enhancement (LMSE), Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA)<n> MPA achieves superior performance compared to existing state-of-the-art methods across most settings.
arXiv Detail & Related papers (2026-02-09T08:30:31Z)
UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection [37.37926854174864]
In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability.<n>We propose a novel Unimodal-generated Multimodal Contrastive Learning framework for cross-modal-rate deepfake detection.<n>Our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection.
arXiv Detail & Related papers (2025-11-24T10:56:22Z)
Improving Progressive Generation with Decomposable Flow Matching [50.63174319509629]
Decomposable Flow Matching (DFM) is a simple and effective framework for the progressive generation of visual media.<n>On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline.
arXiv Detail & Related papers (2025-06-24T17:58:02Z)
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates [37.65554922794508]
We introduce Multimodal Adversarial Compositionality (MAC) to generate deceptive text samples.<n>We evaluate them through both sample-wise attack success rate and group-wise entropy-based diversity.<n>Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities.
arXiv Detail & Related papers (2025-05-28T23:45:55Z)
MSFNet-CPD: Multi-Scale Cross-Modal Fusion Network for Crop Pest Detection [3.5148549831413036]
Accurate identification of agricultural pests is essential for crop protection.<n>While deep learning has advanced pest detection, most existing approaches rely solely on low-level visual features.
arXiv Detail & Related papers (2025-05-05T08:10:22Z)
MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia.<n>Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored.<n>We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Learning from Multi-Perception Features for Real-Word Image Super-resolution [87.71135803794519]
We propose a novel SR method called MPF-Net that leverages multiple perceptual features of input images. Our method incorporates a Multi-Perception Feature Extraction (MPFE) module to extract diverse perceptual information. We also introduce a contrastive regularization term (CR) that improves the model's learning capability.
arXiv Detail & Related papers (2023-05-26T07:35:49Z)
Multi-Scale Positive Sample Refinement for Few-Shot Object Detection [61.60255654558682]
Few-shot object detection (FSOD) helps detectors adapt to unseen classes with few training instances. We propose a Multi-scale Positive Sample Refinement (MPSR) approach to enrich object scales in FSOD. MPSR generates multi-scale positive samples as object pyramids and refines the prediction at various scales.
arXiv Detail & Related papers (2020-07-18T09:48:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.