Related papers: Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization

Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization

URL: http://arxiv.org/abs/2506.08493v1
Date: Tue, 10 Jun 2025 06:40:43 GMT
Title: Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization
Authors: Qilin Yin, Wei Lu, Xiangyang Luo, Xiaochun Cao,
Abstract summary: We propose a universal context-aware contrastive learning framework (UniCaCLF) for temporal forgery localization.<n>Our approach leverages supervised contrastive learning to discover and identify forged instants by means of anomaly detection.<n>An efficient context-aware contrastive coding is introduced to further push the limit of instant feature distinguishability between genuine and forged instants.
Score: 60.73623588349311
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most research efforts in the multimedia forensics domain have focused on detecting forgery audio-visual content and reached sound achievements. However, these works only consider deepfake detection as a classification task and ignore the case where partial segments of the video are tampered with. Temporal forgery localization (TFL) of small fake audio-visual clips embedded in real videos is still challenging and more in line with realistic application scenarios. To resolve this issue, we propose a universal context-aware contrastive learning framework (UniCaCLF) for TFL. Our approach leverages supervised contrastive learning to discover and identify forged instants by means of anomaly detection, allowing for the precise localization of temporal forged segments. To this end, we propose a novel context-aware perception layer that utilizes a heterogeneous activation operation and an adaptive context updater to construct a context-aware contrastive objective, which enhances the discriminability of forged instant features by contrasting them with genuine instant features in terms of their distances to the global context. An efficient context-aware contrastive coding is introduced to further push the limit of instant feature distinguishability between genuine and forged instants in a supervised sample-by-sample manner, suppressing the cross-sample influence to improve temporal forgery localization performance. Extensive experimental results over five public datasets demonstrate that our proposed UniCaCLF significantly outperforms the state-of-the-art competing algorithms.

Related papers

Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z)
A Multimodal Deviation Perceiving Framework for Weakly-Supervised Temporal Forgery Localization [21.13433908232578]
We present a framework for weakly-supervised temporal forgery localization.<n>It aims to identify temporal partial forged segments using only video-level annotations.<n>Extensive experiments demonstrate the effectiveness of the proposed framework.
arXiv Detail & Related papers (2025-07-22T13:55:16Z)
Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network [17.91342898415867]
Existing ATFL methods rely on training efficient networks using fine-grained annotations.<n>We propose a progressive audio-language co-learning network (LOCO) that adopts co-learning and self-supervision manners to prompt localization performance.<n>The proposed LOCO achieves SOTA performance on three public benchmarks.
arXiv Detail & Related papers (2025-05-03T17:57:57Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization [60.899082019130766]
We introduce a frame-level detection network (FDN) and a proposal refinement network (PRN) for audio temporal forgery detection and localization. FDN aims to mine informative inconsistency cues between real and fake frames to obtain discriminative features that are beneficial for roughly indicating forgery regions. PRN is responsible for predicting confidence scores and regression offsets to refine the coarse-grained proposals derived from the FDN.
arXiv Detail & Related papers (2024-07-23T15:07:52Z)
Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment. Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z)
Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization [67.88493779080882]
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Recent works contrast the cross-modality similarities driven by reconstructing masked queries. We propose a novel proposed counterfactual cross-modality reasoning method.
arXiv Detail & Related papers (2023-08-10T15:45:45Z)
Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training. Key to our approach is the use of both global and local temporal constraints. Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.