Do Acoustic Word Embeddings Capture Phonological Similarity? An
Empirical Study
- URL: http://arxiv.org/abs/2106.08686v1
- Date: Wed, 16 Jun 2021 10:47:56 GMT
- Title: Do Acoustic Word Embeddings Capture Phonological Similarity? An
Empirical Study
- Authors: Badr M. Abdullah, Marius Mosbach, Iuliia Zaitova, Bernd M\"obius,
Dietrich Klakow
- Abstract summary: In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity?
We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity.
Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity.
- Score: 12.210797811981173
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Several variants of deep neural networks have been successfully employed for
building parametric models that project variable-duration spoken word segments
onto fixed-size vector representations, or acoustic word embeddings (AWEs).
However, it remains unclear to what degree we can rely on the distance in the
emerging AWE space as an estimate of word-form similarity. In this paper, we
ask: does the distance in the acoustic embedding space correlate with
phonological dissimilarity? To answer this question, we empirically investigate
the performance of supervised approaches for AWEs with different neural
architectures and learning objectives. We train AWE models in controlled
settings for two languages (German and Czech) and evaluate the embeddings on
two tasks: word discrimination and phonological similarity. Our experiments
show that (1) the distance in the embedding space in the best cases only
moderately correlates with phonological distance, and (2) improving the
performance on the word discrimination task does not necessarily yield models
that better reflect word phonological similarity. Our findings highlight the
necessity to rethink the current intrinsic evaluations for AWEs.
Related papers
- Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - Neural approaches to spoken content embedding [1.3706331473063877]
We contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs)
We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition.
arXiv Detail & Related papers (2023-08-28T21:16:08Z) - Analyzing the Representational Geometry of Acoustic Word Embeddings [22.677210029168588]
Acoustic word embeddings (AWEs) are vector representations such that different acoustic exemplars of the same word are projected nearby.
This paper takes a closer analytical look at AWEs learned from English speech and study how the choice of the learning objective and the architecture shapes their representational profile.
arXiv Detail & Related papers (2023-01-08T10:22:50Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - How Familiar Does That Sound? Cross-Lingual Representational Similarity
Analysis of Acoustic Word Embeddings [12.788276426899312]
We present a novel design based on representational similarity analysis (RSA) to analyze acoustic word embeddings (AWEs)
First, we train monolingual AWE models on seven Indo-European languages with various degrees of typological similarity.
We then employ RSA to quantify the cross-lingual similarity by simulating native and non-native spoken-word processing using AWEs.
arXiv Detail & Related papers (2021-09-21T13:51:39Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - NLP-CIC @ DIACR-Ita: POS and Neighbor Based Distributional Models for
Lexical Semantic Change in Diachronic Italian Corpora [62.997667081978825]
We present our systems and findings on unsupervised lexical semantic change for the Italian language.
The task is to determine whether a target word has evolved its meaning with time, only relying on raw-text from two time-specific datasets.
We propose two models representing the target words across the periods to predict the changing words using threshold and voting schemes.
arXiv Detail & Related papers (2020-11-07T11:27:18Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Cross-Domain Adaptation of Spoken Language Identification for Related
Languages: The Curious Case of Slavic Languages [17.882477802269243]
We present a set of experiments to investigate the impact of domain mismatch on the performance of neural LID systems.
We show that out-of-domain speech samples severely hinder the performance of neural LID models.
We achieve relative accuracy improvements that range from 9% to 77% depending on the diversity of acoustic conditions in the source domain.
arXiv Detail & Related papers (2020-08-02T19:30:39Z) - Mechanisms for Handling Nested Dependencies in Neural-Network Language
Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing.
Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement.
We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z) - Analyzing autoencoder-based acoustic word embeddings [37.78342106714364]
Acoustic word embeddings (AWEs) are representations of words which encode their acoustic features.
We analyze basic properties of AWE spaces learned by a sequence-to-sequence encoder-decoder model in six typologically diverse languages.
AWEs exhibit a word onset bias, similar to patterns reported in various studies on human speech processing and lexical access.
arXiv Detail & Related papers (2020-04-03T16:11:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.