Unsupervised Multimodal Word Discovery based on Double Articulation
Analysis with Co-occurrence cues
- URL: http://arxiv.org/abs/2201.06786v2
- Date: Mon, 21 Aug 2023 06:58:13 GMT
- Title: Unsupervised Multimodal Word Discovery based on Double Articulation
Analysis with Co-occurrence cues
- Authors: Akira Taniguchi, Hiroaki Murakami, Ryo Ozaki, Tadahiro Taniguchi
- Abstract summary: Human infants acquire their verbal lexicon with minimal prior knowledge of language.
This study proposes a novel fully unsupervised learning method for discovering speech units.
The proposed method can acquire words and phonemes from speech signals using unsupervised learning.
- Score: 7.332652485849632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human infants acquire their verbal lexicon with minimal prior knowledge of
language based on the statistical properties of phonological distributions and
the co-occurrence of other sensory stimuli. This study proposes a novel fully
unsupervised learning method for discovering speech units using phonological
information as a distributional cue and object information as a co-occurrence
cue. The proposed method can acquire words and phonemes from speech signals
using unsupervised learning and utilize object information based on multiple
modalities-vision, tactile, and auditory-simultaneously. The proposed method is
based on the nonparametric Bayesian double articulation analyzer (NPB-DAA)
discovering phonemes and words from phonological features, and multimodal
latent Dirichlet allocation (MLDA) categorizing multimodal information obtained
from objects. In an experiment, the proposed method showed higher word
discovery performance than baseline methods. Words that expressed the
characteristics of objects (i.e., words corresponding to nouns and adjectives)
were segmented accurately. Furthermore, we examined how learning performance is
affected by differences in the importance of linguistic information. Increasing
the weight of the word modality further improved performance relative to that
of the fixed condition.
Related papers
- Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Deep Neural Convolutive Matrix Factorization for Articulatory
Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores.
Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z) - Deep Learning For Prominence Detection In Children's Read Speech [13.041607703862724]
We present a system that operates on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment.
The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters.
arXiv Detail & Related papers (2021-10-27T08:51:42Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - The effectiveness of unsupervised subword modeling with autoregressive
and cross-lingual phone-aware networks [36.24509775775634]
We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer.
Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies.
arXiv Detail & Related papers (2020-12-17T12:33:49Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Measuring Memorization Effect in Word-Level Neural Networks Probing [0.9156064716689833]
We propose a simple general method for measuring the memorization effect, based on a symmetric selection of test words seen versus unseen in training.
Our method can be used to explicitly quantify the amount of memorization happening in a probing setup, so that an adequate setup can be chosen and the results of the probing can be interpreted with a reliability estimate.
arXiv Detail & Related papers (2020-06-29T14:35:42Z) - Analyzing autoencoder-based acoustic word embeddings [37.78342106714364]
Acoustic word embeddings (AWEs) are representations of words which encode their acoustic features.
We analyze basic properties of AWE spaces learned by a sequence-to-sequence encoder-decoder model in six typologically diverse languages.
AWEs exhibit a word onset bias, similar to patterns reported in various studies on human speech processing and lexical access.
arXiv Detail & Related papers (2020-04-03T16:11:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.