VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer
- URL: http://arxiv.org/abs/2203.04099v1
- Date: Tue, 8 Mar 2022 14:08:47 GMT
- Title: VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer
- Authors: Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro
- Abstract summary: This paper presents an audio-visual approach for voice separation.
It outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice.
- Score: 4.167459103689587
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents an audio-visual approach for voice separation which
outperforms state-of-the-art methods at a low latency in two scenarios: speech
and singing voice. The model is based on a two-stage network. Motion cues are
obtained with a lightweight graph convolutional network that processes face
landmarks. Then, both audio and motion features are fed to an audio-visual
transformer which produces a fairly good estimation of the isolated target
source. In a second stage, the predominant voice is enhanced with an audio-only
network. We present different ablation studies and comparison to
state-of-the-art methods. Finally, we explore the transferability of models
trained for speech separation in the task of singing voice separation. The
demos, code, and weights will be made publicly available at
https://ipcv.github.io/VoViT/
Related papers
- From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - A cappella: Audio-visual Singing Voice Separation [4.6453787256723365]
We explore the single-channel singing voice separation problem from a multimodal perspective.
We present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube.
We propose Y-Net, an audio-visual convolutional neural network which achieves state-of-the-art singing voice separation results.
arXiv Detail & Related papers (2021-04-20T13:17:06Z) - Visually Guided Sound Source Separation and Localization using
Self-Supervised Motion Representations [16.447597767676655]
We aim to pinpoint the source location in the input video sequence.
Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type.
We propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues.
arXiv Detail & Related papers (2021-04-17T10:09:15Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.