Human Detection of Political Speech Deepfakes across Transcripts, Audio,
and Video
- URL: http://arxiv.org/abs/2202.12883v4
- Date: Mon, 15 Jan 2024 22:14:36 GMT
- Title: Human Detection of Political Speech Deepfakes across Transcripts, Audio,
and Video
- Authors: Matthew Groh, Aruna Sankaranarayanan, Nikhil Singh, Dong Young Kim,
Andrew Lippman, Rosalind Picard
- Abstract summary: Recent advances in technology for hyper-realistic visual and audio effects provoke the concern that deepfake videos of political speeches will soon be indistinguishable from authentic video recordings.
We conduct 5 pre-registered randomized experiments with 2,215 participants to evaluate how accurately humans distinguish real political speeches from fabrications.
We find base rates of misinformation minimally influence discernment and deepfakes with audio produced by the state-of-the-art text-to-speech algorithms are harder to discern than the same deepfakes with voice actor audio.
- Score: 4.78385214366452
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in technology for hyper-realistic visual and audio effects
provoke the concern that deepfake videos of political speeches will soon be
indistinguishable from authentic video recordings. The conventional wisdom in
communication theory predicts people will fall for fake news more often when
the same version of a story is presented as a video versus text. We conduct 5
pre-registered randomized experiments with 2,215 participants to evaluate how
accurately humans distinguish real political speeches from fabrications across
base rates of misinformation, audio sources, question framings, and media
modalities. We find base rates of misinformation minimally influence
discernment and deepfakes with audio produced by the state-of-the-art
text-to-speech algorithms are harder to discern than the same deepfakes with
voice actor audio. Moreover across all experiments, we find audio and visual
information enables more accurate discernment than text alone: human
discernment relies more on how something is said, the audio-visual cues, than
what is said, the speech content.
Related papers
- SafeEar: Content Privacy-Preserving Audio Deepfake Detection [17.859275594843965]
We propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within.
Our key idea is to devise a neural audio into a novel decoupling model that well separates the semantic and acoustic information from audio samples.
In this way, no semantic content will be exposed to the detector.
arXiv Detail & Related papers (2024-09-14T02:45:09Z) - Unmasking Illusions: Understanding Human Perception of Audiovisual Deepfakes [49.81915942821647]
This paper aims to evaluate the human ability to discern deepfake videos through a subjective study.
We present our findings by comparing human observers to five state-ofthe-art audiovisual deepfake detection models.
We found that all AI models performed better than humans when evaluated on the same 40 videos.
arXiv Detail & Related papers (2024-05-07T07:57:15Z) - Human Brain Exhibits Distinct Patterns When Listening to Fake Versus Real Audio: Preliminary Evidence [10.773283625658513]
In this paper we study the variations in human brain activity when listening to real and fake audio.
Preliminary results suggest that the representations learned by a state-of-the-art deepfake audio detection algorithm, do not exhibit clear distinct patterns between real and fake audio.
arXiv Detail & Related papers (2024-02-22T21:44:58Z) - Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Watch Those Words: Video Falsification Detection Using Word-Conditioned
Facial Motion [82.06128362686445]
We propose a multi-modal semantic forensic approach to handle both cheapfakes and visually persuasive deepfakes.
We leverage the idea of attribution to learn person-specific biometric patterns that distinguish a given speaker from others.
Unlike existing person-specific approaches, our method is also effective against attacks that focus on lip manipulation.
arXiv Detail & Related papers (2021-12-21T01:57:04Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z) - Vocoder-Based Speech Synthesis from Silent Videos [28.94460283719776]
We present a way to synthesise speech from the silent video of a talker using deep learning.
The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm.
arXiv Detail & Related papers (2020-04-06T10:22:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.