Voice Conversion Based Speaker Normalization for Acoustic Unit Discovery
- URL: http://arxiv.org/abs/2105.01786v1
- Date: Tue, 4 May 2021 22:40:41 GMT
- Title: Voice Conversion Based Speaker Normalization for Acoustic Unit Discovery
- Authors: Thomas Glarner, Janek Ebbers, Reinhold H\"ab-Umbach
- Abstract summary: We propose an unsupervised speaker normalization technique prior to unit discovery.
It is based on separating speaker related from content induced variations in a speech signal with an adversarial contrastive predictive coding approach.
Experiments on English, Yoruba and Mboshi show improvements compared to using non-normalized input.
- Score: 3.128267020893596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Discovering speaker independent acoustic units purely from spoken input is
known to be a hard problem. In this work we propose an unsupervised speaker
normalization technique prior to unit discovery. It is based on separating
speaker related from content induced variations in a speech signal with an
adversarial contrastive predictive coding approach. This technique does neither
require transcribed speech nor speaker labels, and, furthermore, can be trained
in a multilingual fashion, thus achieving speaker normalization even if only
few unlabeled data is available from the target language. The speaker
normalization is done by mapping all utterances to a medoid style which is
representative for the whole database. We demonstrate the effectiveness of the
approach by conducting acoustic unit discovery with a hidden Markov model
variational autoencoder noting, however, that the proposed speaker
normalization can serve as a front end to any unit discovery system.
Experiments on English, Yoruba and Mboshi show improvements compared to using
non-normalized input.
Related papers
- Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Self-supervised Predictive Coding Models Encode Speaker and Phonetic
Information in Orthogonal Subspaces [14.301142521638123]
Self-supervised speech representations are known to encode both speaker and phonetic information.
We propose a new speaker normalization method which collapses the subspace that encodes speaker information.
arXiv Detail & Related papers (2023-05-21T14:03:54Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS [36.023566245506046]
We propose a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech.
The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space.
Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations.
arXiv Detail & Related papers (2022-06-21T11:08:05Z) - Revisiting joint decoding based multi-talker speech recognition with DNN
acoustic model [34.061441900912136]
We argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly.
We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers.
arXiv Detail & Related papers (2021-10-31T09:28:04Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.