Repeat after me: Self-supervised learning of acoustic-to-articulatory
mapping by vocal imitation
- URL: http://arxiv.org/abs/2204.02269v1
- Date: Tue, 5 Apr 2022 15:02:49 GMT
- Title: Repeat after me: Self-supervised learning of acoustic-to-articulatory
mapping by vocal imitation
- Authors: Marc-Antoine Georges, Julien Diard, Laurent Girin, Jean-Luc Schwartz,
Thomas Hueber
- Abstract summary: We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters.
Both forward and inverse models are jointly trained in a self-supervised way from raw acoustic-only speech data from different speakers.
The imitation simulations are evaluated objectively and subjectively and display quite encouraging performances.
- Score: 9.416401293559112
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a computational model of speech production combining a pre-trained
neural articulatory synthesizer able to reproduce complex speech stimuli from a
limited set of interpretable articulatory parameters, a DNN-based internal
forward model predicting the sensory consequences of articulatory commands, and
an internal inverse model based on a recurrent neural network recovering
articulatory commands from the acoustic speech input. Both forward and inverse
models are jointly trained in a self-supervised way from raw acoustic-only
speech data from different speakers. The imitation simulations are evaluated
objectively and subjectively and display quite encouraging performances.
Related papers
- CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - SelfVC: Voice Conversion With Iterative Refinement using Self Transformations [42.97689861071184]
SelfVC is a training strategy to improve a voice conversion model with self-synthesized examples.
We develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model.
Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
arXiv Detail & Related papers (2023-10-14T19:51:17Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Deep Neural Convolutive Matrix Factorization for Articulatory
Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores.
Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Mean absorption estimation from room impulse responses using virtually
supervised learning [0.0]
This paper introduces and investigates a new approach to estimate mean absorption coefficients solely from a room impulse response (RIR)
This inverse problem is tackled via virtually-supervised learning, namely, the RIR-to-absorption mapping is implicitly learned by regression on a simulated dataset using artificial neural networks.
arXiv Detail & Related papers (2021-09-01T14:06:20Z) - Learning robust speech representation with an articulatory-regularized
variational autoencoder [13.541055956177937]
We develop an articulatory model able to associate articulatory parameters describing the jaw, tongue, lips and velum configurations with vocal tract shapes and spectral features.
We show that this articulatory constraint improves model training by decreasing time to convergence and reconstruction loss at convergence, and yields better performance in a speech denoising task.
arXiv Detail & Related papers (2021-04-07T15:47:04Z) - Embodied Self-supervised Learning by Coordinated Sampling and Training [14.107020105091662]
We propose a novel self-supervised approach to solve inverse problems by employing the corresponding physical forward process.
The proposed approach works in an analysis-by-synthesis manner to learn an inference network by iteratively sampling and training.
We prove the feasibility of the proposed method by tackling the acoustic-to-articulatory inversion problem to infer articulatory information from speech.
arXiv Detail & Related papers (2020-06-20T14:05:47Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.