Deep generative factorization for speech signal
- URL: http://arxiv.org/abs/2010.14242v1
- Date: Tue, 27 Oct 2020 12:27:58 GMT
- Title: Deep generative factorization for speech signal
- Authors: Haoran Sun, Lantian Li, Yunqi Cai, Yang Zhang, Thomas Fang Zheng, Dong
Wang
- Abstract summary: This paper presents a speech factorization approach based on a novel factorial discriminative normalization flow model (factorial DNF)
Experiments conducted on a two-factor case that involves phonetic content and speaker trait demonstrates that the proposed factorial DNF has powerful capability to factorize speech signals.
- Score: 25.047575079871272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Various information factors are blended in speech signals, which forms the
primary difficulty for most speech information processing tasks. An intuitive
idea is to factorize speech signal into individual information factors (e.g.,
phonetic content and speaker trait), though it turns out to be highly
challenging. This paper presents a speech factorization approach based on a
novel factorial discriminative normalization flow model (factorial DNF).
Experiments conducted on a two-factor case that involves phonetic content and
speaker trait demonstrates that the proposed factorial DNF has powerful
capability to factorize speech signals and outperforms several comparative
models in terms of information representation and manipulation.
Related papers
- Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective [23.49276487518479]
We explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner.
Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling.
arXiv Detail & Related papers (2024-12-22T14:59:19Z) - NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way.
Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details.
Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Predicting phoneme-level prosody latents using AR and flow-based Prior
Networks for expressive speech synthesis [3.6159128762538018]
We show that normalizing flow based prior networks can result in more expressive speech at the cost of a slight drop in quality.
We also propose a Dynamical VAE model, that can generate higher quality speech although with decreased expressiveness and variability compared to the flow based models.
arXiv Detail & Related papers (2022-11-02T17:45:01Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - Disentangling Generative Factors in Natural Language with Discrete
Variational Autoencoders [0.0]
We argue that continuous variables may not be ideal to model features of textual data, due to the fact that most generative factors in text are discrete.
We propose a Variational Autoencoder based method which models language features as discrete variables and encourages independence between variables for learning disentangled representations.
arXiv Detail & Related papers (2021-09-15T09:10:05Z) - General-Purpose Speech Representation Learning through a Self-Supervised
Multi-Granularity Framework [114.63823178097402]
This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning.
Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales.
arXiv Detail & Related papers (2021-02-03T08:13:21Z) - Adversarially learning disentangled speech representations for robust
multi-factor voice conversion [39.91395314356084]
We propose a disentangled speech representation learning framework based on adversarial learning.
Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled.
Experimental results show that the proposed framework significantly improves the robustness of VC on multiple factors.
arXiv Detail & Related papers (2021-01-30T08:29:55Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.