Related papers: Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS

Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS

URL: http://arxiv.org/abs/2206.10256v1
Date: Tue, 21 Jun 2022 11:08:05 GMT
Title: Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS
Authors: Kenta Udagawa, Yuki Saito, Hiroshi Saruwatari
Abstract summary: We propose a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech. The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space. Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations.
Score: 36.023566245506046
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper proposes a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech. With a conventional speaker-adaptation method, a target speaker's embedding vector is extracted from his/her reference speech using a speaker encoder trained on a speaker-discriminative task. However, this method cannot obtain an embedding vector for the target speaker when the reference speech is unavailable. Our method is based on a human-in-the-loop optimization framework, which incorporates a user to explore the speaker-embedding space to find the target speaker's embedding. The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space. To efficiently choose the best speech sample from multiple stimuli, we also developed a system in which a user can switch between multiple speakers' voices for each phoneme while looping an utterance. Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations even if reference speech is not used as the input of a speaker encoder directly.

Related papers

Investigation of Speaker Representation for Target-Speaker Speech Processing [49.110228525976794]
This paper aims to address a fundamental question: what is the preferred speaker embedding for target-speaker speech processing tasks? For the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector. Our analysis reveals speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
arXiv Detail & Related papers (2024-10-15T03:58:13Z)
Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems [27.08293218877395]
Sortformer is an encoder-based speaker diarization model for supervising speaker tagging in speech-to-text models.<n>We propose a streamlined multi-speaker speech-to-text architecture that leverages Sortformer for speaker supervision.<n>The code and trained models are made publicly available through the NVIDIA NeMo Framework.
arXiv Detail & Related papers (2024-09-10T17:20:11Z)
Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition [27.35304346509647]
We introduce speaker labels into an autoregressive transformer-based speech recognition model. We then propose a novel speaker mask branch to detection the speech segments of individual speakers. With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously.
arXiv Detail & Related papers (2023-12-18T06:29:53Z)
Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model. We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z)
Streaming Multi-talker Speech Recognition with Joint Speaker Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification. We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z)
Guided Training: A Simple Method for Single-channel Speaker Separation [40.34570426165019]
We propose a strategy to train a long short-term memory (LSTM) model to solve the permutation problem in speaker separation. Due to the powerful capability on sequence modeling, LSTM can use its memory cells to track and separate target speech from interfering speech.
arXiv Detail & Related papers (2021-03-26T08:46:50Z)
Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent Speech Separation [7.453268060082337]
We propose deep ad-hoc beamforming based on speaker extraction, which is to our knowledge the first work for target-dependent speech separation based on ad-hoc microphone arrays and deep learning. Experimental results demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2020-12-01T11:06:36Z)
Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model. It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z)
Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition. We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)
Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features. We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics. SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures. We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.