Disentangling speech from surroundings with neural embeddings
- URL: http://arxiv.org/abs/2203.15578v2
- Date: Sun, 4 Jun 2023 18:08:38 GMT
- Title: Disentangling speech from surroundings with neural embeddings
- Authors: Ahmed Omran, Neil Zeghidour, Zal\'an Borsos, F\'elix de Chaumont
Quitry, Malcolm Slaney, Marco Tagliasacchi
- Abstract summary: We present a method to separate speech signals from noisy environments in the embedding space of a neural audio.
We introduce a new training procedure that allows our model to produce structured encodings of audio waveforms given by embedding vectors.
- Score: 17.958451380305892
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method to separate speech signals from noisy environments in the
embedding space of a neural audio codec. We introduce a new training procedure
that allows our model to produce structured encodings of audio waveforms given
by embedding vectors, where one part of the embedding vector represents the
speech signal, and the rest represent the environment. We achieve this by
partitioning the embeddings of different input waveforms and training the model
to faithfully reconstruct audio from mixed partitions, thereby ensuring each
partition encodes a separate audio attribute. As use cases, we demonstrate the
separation of speech from background noise or from reverberation
characteristics. Our method also allows for targeted adjustments of the audio
output characteristics.
Related papers
- TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - AudioSlots: A slot-centric generative model for audio separation [26.51135156983783]
We present AudioSlots, a slot-centric generative model for blind source separation in the audio domain.
We train the model in an end-to-end manner using a permutation-equivariant loss function.
Our results on Libri2Mix speech separation constitute a proof of concept that this approach shows promise.
arXiv Detail & Related papers (2023-05-09T16:28:07Z) - CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled
Videos [44.14061539284888]
We propose to approach text-queried universal sound separation by using only unlabeled data.
The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model.
While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting.
arXiv Detail & Related papers (2022-12-14T07:21:45Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings.
To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings.
Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.