Synthesizing Audio from Silent Video using Sequence to Sequence Modeling
- URL: http://arxiv.org/abs/2404.17608v1
- Date: Thu, 25 Apr 2024 22:19:42 GMT
- Title: Synthesizing Audio from Silent Video using Sequence to Sequence Modeling
- Authors: Hugo Garrido-Lestache Belinchon, Helina Mulugeta, Adam Haile,
- Abstract summary: We propose a novel method to generate audio from video using a sequence-to-sequence model.
Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures.
Our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating audio from a video's visual context has multiple practical applications in improving how we interact with audio-visual media - for example, enhancing CCTV footage analysis, restoring historical videos (e.g., silent movies), and improving video generation models. We propose a novel method to generate audio from video using a sequence-to-sequence model, improving on prior work that used CNNs and WaveNet and faced sound diversity and generalization challenges. Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures, decoding with a custom audio decoder for a broader range of sounds. Trained on the Youtube8M dataset segment, focusing on specific domains, our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.
Related papers
- EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos [3.6078215038168473]
We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos.
generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets.
arXiv Detail & Related papers (2024-07-30T06:57:00Z) - C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Sounding Video Generator: A Unified Framework for Text-guided Sounding
Video Generation [24.403772976932487]
Sounding Video Generator (SVG) is a unified framework for generating realistic videos along with audio signals.
VQGAN transforms visual frames and audio melspectrograms into discrete tokens.
Transformer-based decoder is used to model associations between texts, visual frames, and audio signals.
arXiv Detail & Related papers (2023-03-29T09:07:31Z) - Object Segmentation with Audio Context [0.5243460995467893]
This project explores the multimodal feature aggregation for video instance segmentation task.
We integrate audio features into our video segmentation model to conduct an audio-visual learning scheme.
arXiv Detail & Related papers (2023-01-04T01:33:42Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.