SLMGAN: Exploiting Speech Language Model Representations for
Unsupervised Zero-Shot Voice Conversion in GANs
- URL: http://arxiv.org/abs/2307.09435v1
- Date: Tue, 18 Jul 2023 17:09:15 GMT
- Title: SLMGAN: Exploiting Speech Language Model Representations for
Unsupervised Zero-Shot Voice Conversion in GANs
- Authors: Yinghao Aaron Li, Cong Han, Nima Mesgarani
- Abstract summary: This paper introduces a new approach, SLMGAN, to leverage SLM representations for discriminative tasks within the generative adversarial network (GAN) framework.
Building upon StarGANv2-VC, we add our novel SLM-based WavLM discriminators on top of the mel-based discriminators along with our newly designed SLM feature matching loss function.
Subjective evaluation results show that SLMGAN outperforms existing state-of-the-art zero-shot voice conversion models in terms of naturalness and achieves comparable similarity.
- Score: 22.522376665078248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, large-scale pre-trained speech language models (SLMs) have
demonstrated remarkable advancements in various generative speech modeling
applications, such as text-to-speech synthesis, voice conversion, and speech
enhancement. These applications typically involve mapping text or speech inputs
to pre-trained SLM representations, from which target speech is decoded. This
paper introduces a new approach, SLMGAN, to leverage SLM representations for
discriminative tasks within the generative adversarial network (GAN) framework,
specifically for voice conversion. Building upon StarGANv2-VC, we add our novel
SLM-based WavLM discriminators on top of the mel-based discriminators along
with our newly designed SLM feature matching loss function, resulting in an
unsupervised zero-shot voice conversion system that does not require text
labels during training. Subjective evaluation results show that SLMGAN
outperforms existing state-of-the-art zero-shot voice conversion models in
terms of naturalness and achieves comparable similarity, highlighting the
potential of SLM-based discriminators for related applications.
Related papers
- DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [45.791472119671916]
Spoken language models (SLMs) process text and speech, enabling simultaneous speech understanding and generation.
DC-Spin aims to improve speech tokenization by bridging audio signals and SLM tokens.
We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation.
arXiv Detail & Related papers (2024-10-31T17:43:13Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.
Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.
Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - Enhancing the Stability of LLM-based Speech Generation Systems through
Self-Supervised Representations [14.437646262239612]
Self-supervised Voice Conversion (VC) architecture can be used to learn to encode transitory features, such as content, separately from stationary ones, such as speaker ID or recording conditions, creating speaker-disentangled representations.
Using speaker-disentangled codes to train LLMs for text-to-speech (TTS) allows the LLM to generate the content and the style of the speech only from the text, similarly to humans, while the speaker identity is provided by the decoder of the VC model.
Results show that LLMs trained over speaker-disentangled self-supervised representations provide an improvement of 4.7pp
arXiv Detail & Related papers (2024-02-05T15:08:19Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Adversarial Speaker Disentanglement Using Unannotated External Data for
Self-supervised Representation Based Voice Conversion [35.23123094710891]
We propose a high-similarity any-to-one voice conversion method with the input of SSL representations.
Experimental results show that our proposed method achieves comparable similarity and higher naturalness than the supervised method.
arXiv Detail & Related papers (2023-05-16T04:52:29Z) - Text-Free Prosody-Aware Generative Spoken Language Modeling [46.19240899818964]
We present a prosody-aware generative spoken language model (pGSLM)
It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms.
Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.
arXiv Detail & Related papers (2021-09-07T18:03:21Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.