Learning Joint Articulatory-Acoustic Representations with Normalizing
Flows
- URL: http://arxiv.org/abs/2005.09463v2
- Date: Thu, 1 Oct 2020 03:54:41 GMT
- Title: Learning Joint Articulatory-Acoustic Representations with Normalizing
Flows
- Authors: Pramit Saha, Sidney Fels
- Abstract summary: We find a joint latent representation between the articulatory and acoustic domain for vowel sounds via invertible neural network models.
Our approach achieves both articulatory-to-acoustic as well as acoustic-to-articulatory mapping, thereby demonstrating our success in achieving a joint encoding of both the domains.
- Score: 7.183132975698293
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The articulatory geometric configurations of the vocal tract and the acoustic
properties of the resultant speech sound are considered to have a strong causal
relationship. This paper aims at finding a joint latent representation between
the articulatory and acoustic domain for vowel sounds via invertible neural
network models, while simultaneously preserving the respective domain-specific
features. Our model utilizes a convolutional autoencoder architecture and
normalizing flow-based models to allow both forward and inverse mappings in a
semi-supervised manner, between the mid-sagittal vocal tract geometry of a two
degrees-of-freedom articulatory synthesizer with 1D acoustic wave model and the
Mel-spectrogram representation of the synthesized speech sounds. Our approach
achieves satisfactory performance in achieving both articulatory-to-acoustic as
well as acoustic-to-articulatory mapping, thereby demonstrating our success in
achieving a joint encoding of both the domains.
Related papers
- Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Implicit Neural Spatial Filtering for Multichannel Source Separation in
the Waveform Domain [131.74762114632404]
The model is trained end-to-end and performs spatial processing implicitly.
We evaluate the proposed model on a real-world dataset and show that the model matches the performance of an oracle beamformer.
arXiv Detail & Related papers (2022-06-30T17:13:01Z) - Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention
Guided Heterogeneous Translator [12.685817926272161]
We develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size.
Our framework is based on a novel fully convolutional asymmetry translator with guidance of a self residual attention strategy.
Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, showed that our framework enabled the generation of clear audio waveforms.
arXiv Detail & Related papers (2022-06-05T23:08:34Z) - Repeat after me: Self-supervised learning of acoustic-to-articulatory
mapping by vocal imitation [9.416401293559112]
We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters.
Both forward and inverse models are jointly trained in a self-supervised way from raw acoustic-only speech data from different speakers.
The imitation simulations are evaluated objectively and subjectively and display quite encouraging performances.
arXiv Detail & Related papers (2022-04-05T15:02:49Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Adaptation of Tacotron2-based Text-To-Speech for
Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging [48.7576911714538]
This paper experiments with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve articulatory-to-acoustic mapping.
We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder.
arXiv Detail & Related papers (2021-07-26T09:19:20Z) - Learning robust speech representation with an articulatory-regularized
variational autoencoder [13.541055956177937]
We develop an articulatory model able to associate articulatory parameters describing the jaw, tongue, lips and velum configurations with vocal tract shapes and spectral features.
We show that this articulatory constraint improves model training by decreasing time to convergence and reconstruction loss at convergence, and yields better performance in a speech denoising task.
arXiv Detail & Related papers (2021-04-07T15:47:04Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z) - Articulatory-WaveNet: Autoregressive Model For Acoustic-to-Articulatory
Inversion [6.58411552613476]
Articulatory-WaveNet is a new approach for acoustic-to-articulator inversion.
System was trained and evaluated on the ElectroMagnetic Articulography corpus of Mandarin Accented English.
arXiv Detail & Related papers (2020-06-22T20:10:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.