Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem
- URL: http://arxiv.org/abs/2112.09382v1
- Date: Fri, 17 Dec 2021 08:35:40 GMT
- Title: Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem
- Authors: Jing Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe,
Bo Xu
- Abstract summary: This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
- Score: 65.25725367771075
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep learning based models have significantly improved the performance of
speech separation with input mixtures like the cocktail party. Prominent
methods (e.g., frequency-domain and time-domain speech separation) usually
build regression models to predict the ground-truth speech from the mixture,
using the masking-based design and the signal-level loss criterion (e.g., MSE
or SI-SNR). This study demonstrates, for the first time, that the
synthesis-based approach can also perform well on this problem, with great
flexibility and strong potential. Specifically, we propose a novel speech
separation/enhancement model based on the recognition of discrete symbols, and
convert the paradigm of the speech separation/enhancement related tasks from
regression to classification. By utilizing the synthesis model with the input
of discrete symbols, after the prediction of discrete symbol sequence, each
target speech could be re-synthesized. Evaluation results based on the
WSJ0-2mix and VCTK-noisy corpora in various settings show that our proposed
method can steadily synthesize the separated speech with high speech quality
and without any interference, which is difficult to avoid in regression-based
methods. In addition, with negligible loss of listening quality, the speaker
conversion of enhanced/separated speech could be easily realized through our
method.
Related papers
- Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual
Voice Conversion [2.3443118032034396]
This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing.
Our model outperforms existing state-of-the-art results in both subjective and objective evaluations.
arXiv Detail & Related papers (2023-10-10T11:50:16Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Advances in Speech Vocoding for Text-to-Speech with Continuous
Parameters [2.6572330982240935]
This paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system.
New continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise.
Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human.
arXiv Detail & Related papers (2021-06-19T12:05:01Z) - Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech [4.348588963853261]
We introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms.
The framework of flexible differential equations helps us to generalize conventional diffusion probabilistic models.
Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score.
arXiv Detail & Related papers (2021-05-13T14:47:44Z) - Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.