Deep Learning and Synthetic Media
- URL: http://arxiv.org/abs/2205.05764v1
- Date: Wed, 11 May 2022 20:28:09 GMT
- Title: Deep Learning and Synthetic Media
- Authors: Rapha\"el Milli\`ere
- Abstract summary: I argue that "deepfakes" and related synthetic media produced with such pipelines do not merely offer incremental improvements over previous methods.
I argue that "deepfakes" and related synthetic media produced with such pipelines pave the way for genuinely novel kinds of audiovisual media.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning algorithms are rapidly changing the way in which audiovisual
media can be produced. Synthetic audiovisual media generated with deep learning
- often subsumed colloquially under the label "deepfakes" - have a number of
impressive characteristics; they are increasingly trivial to produce, and can
be indistinguishable from real sounds and images recorded with a sensor. Much
attention has been dedicated to ethical concerns raised by this technological
development. Here, I focus instead on a set of issues related to the notion of
synthetic audiovisual media, its place within a broader taxonomy of audiovisual
media, and how deep learning techniques differ from more traditional approaches
to media synthesis. After reviewing important etiological features of deep
learning pipelines for media manipulation and generation, I argue that
"deepfakes" and related synthetic media produced with such pipelines do not
merely offer incremental improvements over previous methods, but challenge
traditional taxonomical distinctions, and pave the way for genuinely novel
kinds of audiovisual media.
Related papers
- ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z) - Learning to Highlight Audio by Watching Movies [37.9846964966927]
We introduce visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video.<n>To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies.<n>Our approach consistently outperforms several baselines in both quantitative and subjective evaluation.
arXiv Detail & Related papers (2025-05-17T22:03:57Z) - Re-calibrating methodologies in social media research: Challenge the visual, work with Speech [0.0]
This article reflects on how social media scholars can effectively engage with speech-based data in their analyses.
I conclude that the expansion of our methodological repertoire enables richer interpretations of platformised content.
arXiv Detail & Related papers (2024-12-17T18:47:57Z) - Understanding Audiovisual Deepfake Detection: Techniques, Challenges, Human Factors and Perceptual Insights [49.81915942821647]
Deep Learning has been successfully applied in diverse fields, and its impact on deepfake detection is no exception.
Deepfakes are fake yet realistic synthetic content that can be used deceitfully for political impersonation, phishing, slandering, or spreading misinformation.
This paper aims to improve the effectiveness of deepfake detection strategies and guide future research in cybersecurity and media integrity.
arXiv Detail & Related papers (2024-11-12T09:02:11Z) - Video-to-Audio Generation with Hidden Alignment [27.11625918406991]
We offer insights into the video-to-audio generation paradigm, focusing on vision encoders, auxiliary embeddings, and data augmentation techniques.
We demonstrate our model exhibits state-of-the-art video-to-audio generation capabilities.
arXiv Detail & Related papers (2024-07-10T08:40:39Z) - A Survey of Deep Learning Audio Generation Methods [0.0]
This article presents a review of typical techniques used in three distinct aspects of deep learning model development for audio generation.
In the first part, we provide an explanation of audio representations, beginning with the fundamental audio waveform.
We then progress to the frequency domain, with an emphasis on the attributes of human hearing, and finally introduce a relatively recent development.
arXiv Detail & Related papers (2024-05-31T19:20:27Z) - As Good As A Coin Toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli [0.0]
The principal defense against being misled by synthetic media relies on the ability of the human observer to visually and auditorily discern between real and fake.
We conducted a perceptual study with 1276 participants to assess how accurate people were at distinguishing synthetic images, audio only, video only, and audiovisual stimuli from authentic.
arXiv Detail & Related papers (2024-03-25T13:39:33Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake
Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos.
We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics.
Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Single-Layer Vision Transformers for More Accurate Early Exits with Less
Overhead [88.17413955380262]
We introduce a novel architecture for early exiting based on the vision transformer architecture.
We show that our method works for both classification and regression problems.
We also introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis.
arXiv Detail & Related papers (2021-05-19T13:30:34Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.