Articulation GAN: Unsupervised modeling of articulatory learning
- URL: http://arxiv.org/abs/2210.15173v1
- Date: Thu, 27 Oct 2022 05:07:04 GMT
- Title: Articulation GAN: Unsupervised modeling of articulatory learning
- Authors: Ga\v{s}per Begu\v{s}, Alan Zhou, Peter Wu, Gopala K Anumanchipalli
- Abstract summary: We introduce the Articulatory Generator to the Generative Adrial Network paradigm.
A separate pre-trained physical model transforms the generated EMA representations to speech waveforms.
Articulatory analysis of the generated EMA representations suggests that the network learns to control articulators in a manner that closely follows human articulators during speech production.
- Score: 6.118463549086599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative deep neural networks are widely used for speech synthesis, but
most existing models directly generate waveforms or spectral outputs. Humans,
however, produce speech by controlling articulators, which results in the
production of speech sounds through physical properties of sound propagation.
We propose a new unsupervised generative model of speech production/synthesis
that includes articulatory representations and thus more closely mimics human
speech production. We introduce the Articulatory Generator to the Generative
Adversarial Network paradigm. The Articulatory Generator needs to learn to
generate articulatory representations (electromagnetic articulography or EMA)
in a fully unsupervised manner without ever accessing EMA data. A separate
pre-trained physical model (ema2wav) then transforms the generated EMA
representations to speech waveforms, which get sent to the Discriminator for
evaluation. Articulatory analysis of the generated EMA representations suggests
that the network learns to control articulators in a manner that closely
follows human articulators during speech production. Acoustic analysis of the
outputs suggest that the network learns to generate words that are part of
training data as well as novel innovative words that are absent from training
data. Our proposed architecture thus allows modeling of articulatory learning
with deep neural networks from raw audio inputs in a fully unsupervised manner.
We additionally discuss implications of articulatory representations for
cognitive models of human language and speech technology in general.
Related papers
- SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Repeat after me: Self-supervised learning of acoustic-to-articulatory
mapping by vocal imitation [9.416401293559112]
We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters.
Both forward and inverse models are jointly trained in a self-supervised way from raw acoustic-only speech data from different speakers.
The imitation simulations are evaluated objectively and subjectively and display quite encouraging performances.
arXiv Detail & Related papers (2022-04-05T15:02:49Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Modeling speech recognition and synthesis simultaneously: Encoding and
decoding lexical and sublexical semantic information into speech with no
direct access to speech data [0.0]
We introduce, to our knowledge, the most challenging objective in unsupervised lexical learning: an unsupervised network that must learn to assign unique representations for lexical items.
Strong evidence in favor of lexical learning emerges.
The architecture that combines the production and perception principles is thus able to learn to decode unique information from raw acoustic data in an unsupervised manner without ever accessing real training data.
arXiv Detail & Related papers (2022-03-22T06:04:34Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in
Real-Time MRI [9.614694312155798]
We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production.
The proposed framework comprises of convolutions, recurrent network, and connectionist temporal classification loss, trained entirely end-to-end.
To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's arttory motions captured by rtMRI video.
arXiv Detail & Related papers (2021-06-16T11:20:02Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z) - Generative Adversarial Phonology: Modeling unsupervised phonetic and
phonological learning with neural networks [0.0]
Training deep neural networks on well-understood dependencies in speech data can provide new insights into how they learn internal representations.
This paper argues that acquisition of speech can be modeled as a dependency between random space and generated speech data in the Generative Adversarial Network architecture.
We propose a methodology to uncover the network's internal representations that correspond to phonetic and phonological properties.
arXiv Detail & Related papers (2020-06-06T20:31:23Z) - CiwGAN and fiwGAN: Encoding information in acoustic data to model
lexical learning with Generative Adversarial Networks [0.0]
Lexical learning is modeled as emergent from an architecture that forces a deep neural network to output data.
Networks trained on lexical items from TIMIT learn to encode unique information corresponding to lexical items in the form of categorical variables in their latent space.
We show that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech.
arXiv Detail & Related papers (2020-06-04T15:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.