EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos
- URL: http://arxiv.org/abs/2407.20592v1
- Date: Tue, 30 Jul 2024 06:57:00 GMT
- Title: EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos
- Authors: Aashish Rai, Srinath Sridhar,
- Abstract summary: We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos.
generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets.
- Score: 3.6078215038168473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos. Generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets. Existing work has been limited to domains like speech, music, or impact sounds and cannot easily capture the broad range of audio frequencies found in egocentric videos. EgoSonics addresses these limitations by building on the strength of latent diffusion models for conditioned audio synthesis. We first encode and process audio and video data into a form that is suitable for generation. The encoded data is used to train our model to generate audio tracks that capture the semantics of the input video. Our proposed SyncroNet builds on top of ControlNet to provide control signals that enables temporal synchronization to the synthesized audio. Extensive evaluations show that our model outperforms existing work in audio quality, and in our newly proposed synchronization evaluation method. Furthermore, we demonstrate downstream applications of our model in improving video summarization.
Related papers
- Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound [6.638504164134713]
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically.
Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges.
We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts.
arXiv Detail & Related papers (2024-08-21T18:06:15Z) - Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
We propose a novel video-and-text-to-sound generation method called ReWaS.
Our method estimates the structural information of audio from the video while receiving key content cues from a user prompt.
By separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - Synthesizing Audio from Silent Video using Sequence to Sequence Modeling [0.0]
We propose a novel method to generate audio from video using a sequence-to-sequence model.
Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures.
Our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.
arXiv Detail & Related papers (2024-04-25T22:19:42Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio
Codec and Latent Diffusion Models [25.966328901566815]
We propose HiddenSinger, a high-quality singing voice synthesis system using neural audio and latent diffusion models.
In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model.
Experimental results demonstrate that our model outperforms previous models in terms of audio quality.
arXiv Detail & Related papers (2023-06-12T01:21:41Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.