CiwaGAN: Articulatory information exchange
- URL: http://arxiv.org/abs/2309.07861v1
- Date: Thu, 14 Sep 2023 17:10:39 GMT
- Title: CiwaGAN: Articulatory information exchange
- Authors: Ga\v{s}per Begu\v{s}, Thomas Lu, Alan Zhou, Peter Wu, Gopala K.
Anumanchipalli
- Abstract summary: Humans encode information into sounds by controlling articulators and decode information from sounds using the auditory apparatus.
This paper introduces CiwaGAN, a model of human spoken language acquisition that combines unsupervised articulatory modeling with an unsupervised model of information exchange through the auditory modality.
- Score: 15.944474482218334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans encode information into sounds by controlling articulators and decode
information from sounds using the auditory apparatus. This paper introduces
CiwaGAN, a model of human spoken language acquisition that combines
unsupervised articulatory modeling with an unsupervised model of information
exchange through the auditory modality. While prior research includes
unsupervised articulatory modeling and information exchange separately, our
model is the first to combine the two components. The paper also proposes an
improved articulatory model with more interpretable internal representations.
The proposed CiwaGAN model is the most realistic approximation of human spoken
language acquisition using deep learning. As such, it is useful for cognitively
plausible simulations of the human speech act.
Related papers
- Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Turn-taking and Backchannel Prediction with Acoustic and Large Language
Model Fusion [38.78341787348164]
We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM)
Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality.
arXiv Detail & Related papers (2024-01-26T08:59:07Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Unsupervised Auditory and Semantic Entrainment Models with Deep Neural
Networks [0.3222802562733786]
We present an unsupervised deep learning framework that derives meaningful representation from textual features for developing semantic entrainment.
The results show that semantic entrainment can be assessed with our model, that models can distinguish between HH and HM interactions and that the two units of analysis for extracting acoustic features provide comparable findings.
arXiv Detail & Related papers (2023-12-22T22:33:54Z) - Self-supervised speech unit discovery from articulatory and acoustic
features using VQ-VAE [2.771610203951056]
This study examines how articulatory information can be used for discovering speech units in a self-supervised setting.
We used vector-quantized variational autoencoders (VQ-VAE) to learn discrete representations from articulatory and acoustic speech data.
Experiments were conducted on three different corpora in English and French.
arXiv Detail & Related papers (2022-06-17T14:04:24Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Repeat after me: Self-supervised learning of acoustic-to-articulatory
mapping by vocal imitation [9.416401293559112]
We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters.
Both forward and inverse models are jointly trained in a self-supervised way from raw acoustic-only speech data from different speakers.
The imitation simulations are evaluated objectively and subjectively and display quite encouraging performances.
arXiv Detail & Related papers (2022-04-05T15:02:49Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.