Related papers: UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection

UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection

URL: http://arxiv.org/abs/2511.18983v1
Date: Mon, 24 Nov 2025 10:56:22 GMT
Title: UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection
Authors: Ching-Yi Lai, Chih-Yu Jian, Pei-Cheng Chuang, Chia-Ming Lee, Chih-Chung Hsu, Chiou-Ting Hsu, Chia-Wen Lin,
Abstract summary: In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability.<n>We propose a novel Unimodal-generated Multimodal Contrastive Learning framework for cross-modal-rate deepfake detection.<n>Our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection.
Score: 37.37926854174864
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability. Although existing methods have progressed from single-modal to multimodal approaches, they face critical limitations: single-modal methods struggle with feature degradation under data compression in social media streaming, while multimodal approaches require expensive data collection and labeling and suffer from inconsistent modal quality or accessibility in real-world scenarios. To address these challenges, we propose a novel Unimodal-generated Multimodal Contrastive Learning (UMCL) framework for robust cross-compression-rate (CCR) deepfake detection. In the training stage, our approach transforms a single visual modality into three complementary features: compression-robust rPPG signals, temporal landmark dynamics, and semantic embeddings from pre-trained vision-language models. These features are explicitly aligned through an affinity-driven semantic alignment (ASA) strategy, which models inter-modal relationships through affinity matrices and optimizes their consistency through contrastive learning. Subsequently, our cross-quality similarity learning (CQSL) strategy enhances feature robustness across compression rates. Extensive experiments demonstrate that our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection. Notably, our approach maintains high detection accuracy even when individual features degrade, while providing interpretable insights into feature relationships through explicit alignment.

Related papers

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z)
CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension [49.6969505536365]
We propose CREM, with a unified framework that enhances multimodal representations for retrieval while preserving generative ability.<n>CREM achieves state-of-the-art retrieval performance on MMEB while maintaining strong generative performance on multiple comprehension benchmarks.
arXiv Detail & Related papers (2026-02-22T08:09:51Z)
From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z)
Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation [60.04281435591454]
CRDA (Curriculum Reinforcement-Learning Data Augmentation) is a novel framework guiding detectors to progressively master multi-domain forgery features.<n>Central to our approach is integrating reinforcement learning and causal inference.<n>Our method significantly improves detector generalizability, outperforming SOTA methods across multiple cross-domain datasets.
arXiv Detail & Related papers (2025-11-10T12:45:52Z)
Modest-Align: Data-Efficient Alignment for Vision-Language Models [67.48633659305592]
Cross-modal alignment models often suffer from overconfidence and degraded performance when operating in resource-constrained settings.<n>We propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency.<n>Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.
arXiv Detail & Related papers (2025-10-24T16:11:10Z)
Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech [51.14752758616364]
Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments.<n>We propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework.<n>The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.
arXiv Detail & Related papers (2025-10-05T09:32:12Z)
BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM [12.349038994581415]
We introduce textbfBusterX++, a novel framework for cross-modal detection and explanation of synthetic media.<n>Our approach incorporates an advanced reinforcement learning (RL) post-training strategy that eliminates cold-start.<n>We also present textbfGenBuster++, a cross-modal benchmark leveraging state-of-the-art image and video generation techniques.
arXiv Detail & Related papers (2025-07-19T14:05:33Z)
Consistency-aware Fake Videos Detection on Short Video Platforms [4.291448222735821]
This paper focuses on detecting fake news on the short video platforms.<n>Existing approaches typically combine raw video data with metadata inputs before applying a classification layer.<n>Motivated by this insight, we propose a novel detection paradigm that explicitly identifies and leverages cross-modal contradictions.
arXiv Detail & Related papers (2025-04-30T10:26:04Z)
Bridging Diversity and Uncertainty in Active learning with Self-Supervised Pre-Training [21.57943896942296]
This study addresses the integration of diversity-based and uncertainty-based sampling strategies in active learning.<n>We introduce a straightforward called TCM that mitigates the cold start problem while maintaining strong performance across various data levels.
arXiv Detail & Related papers (2024-03-06T14:18:24Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection [37.99031842449251]
Video anomaly detection under weak supervision presents significant challenges. We present a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability. Our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy.
arXiv Detail & Related papers (2023-06-26T06:45:16Z)
Learning Transferable Adversarial Robust Representations via Multi-view Consistency [57.73073964318167]
We propose a novel meta-adversarial multi-view representation learning framework with dual encoders. We demonstrate the effectiveness of our framework on few-shot learning tasks from unseen domains.
arXiv Detail & Related papers (2022-10-19T11:48:01Z)
Robust Latent Representations via Cross-Modal Translation and Alignment [36.67937514793215]
Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only. The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment.
arXiv Detail & Related papers (2020-11-03T11:18:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.