Related papers: Self-Supervised Video Forensics by Audio-Visual Anomaly Detection

Self-Supervised Video Forensics by Audio-Visual Anomaly Detection

URL: http://arxiv.org/abs/2301.01767v2
Date: Mon, 27 Mar 2023 18:53:32 GMT
Title: Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
Authors: Chao Feng, Ziyang Chen, Andrew Owens
Abstract summary: Manipulated videos often contain subtle inconsistencies between their visual and audio signals. We propose a video forensics method, based on anomaly detection, that can identify these inconsistencies. We train an autoregressive model to generate sequences of audio-visual features, using feature sets that capture the temporal synchronization between video frames and sound.
Score: 19.842795378751923
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Manipulated videos often contain subtle inconsistencies between their visual and audio signals. We propose a video forensics method, based on anomaly detection, that can identify these inconsistencies, and that can be trained solely using real, unlabeled data. We train an autoregressive model to generate sequences of audio-visual features, using feature sets that capture the temporal synchronization between video frames and sound. At test time, we then flag videos that the model assigns low probability. Despite being trained entirely on real videos, our model obtains strong performance on the task of detecting manipulated speech videos. Project site: https://cfeng16.github.io/audio-visual-forensics

Related papers

Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning [3.453303606167197]
We show that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. We propose a shift from supervised to unsupervised learning by training models exclusively on real data.
arXiv Detail & Related papers (2024-11-29T18:58:20Z)
AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection [32.502184301996216]
Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. Previous methods mainly adopt uni-modal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor.
arXiv Detail & Related papers (2023-11-05T18:35:03Z)
AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality. We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z)
Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup. We introduce a unified audio-visual few-shot video classification benchmark on three datasets. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z)
Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample. The proposed two-stage method uses contrastive learning to pretrain the audio representation model. Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z)
Anomaly detection in surveillance videos using transformer based attention model [3.2968779106235586]
This research suggests using a weakly supervised strategy to avoid annotating anomalous segments in training videos. The proposed framework is validated on real-world dataset i.e. ShanghaiTech Campus dataset.
arXiv Detail & Related papers (2022-06-03T12:19:39Z)
Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world. We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity. Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z)
Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos. The sound should be both temporally and content-wise aligned with visual signals. Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.