How Familiar Does That Sound? Cross-Lingual Representational Similarity
Analysis of Acoustic Word Embeddings
- URL: http://arxiv.org/abs/2109.10179v1
- Date: Tue, 21 Sep 2021 13:51:39 GMT
- Title: How Familiar Does That Sound? Cross-Lingual Representational Similarity
Analysis of Acoustic Word Embeddings
- Authors: Badr M. Abdullah, Iuliia Zaitova, Tania Avgustinova, Bernd M\"obius,
Dietrich Klakow
- Abstract summary: We present a novel design based on representational similarity analysis (RSA) to analyze acoustic word embeddings (AWEs)
First, we train monolingual AWE models on seven Indo-European languages with various degrees of typological similarity.
We then employ RSA to quantify the cross-lingual similarity by simulating native and non-native spoken-word processing using AWEs.
- Score: 12.788276426899312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How do neural networks "perceive" speech sounds from unknown languages? Does
the typological similarity between the model's training language (L1) and an
unknown language (L2) have an impact on the model representations of L2 speech
signals? To answer these questions, we present a novel experimental design
based on representational similarity analysis (RSA) to analyze acoustic word
embeddings (AWEs) -- vector representations of variable-duration spoken-word
segments. First, we train monolingual AWE models on seven Indo-European
languages with various degrees of typological similarity. We then employ RSA to
quantify the cross-lingual similarity by simulating native and non-native
spoken-word processing using AWEs. Our experiments show that typological
similarity indeed affects the representational similarity of the models in our
study. We further discuss the implications of our work on modeling speech
processing and language similarity with neural networks.
Related papers
- Perception of Phonological Assimilation by Neural Speech Recognition Models [3.4173734484549625]
This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds.
Using psycholinguistic stimuli, we analyze how various linguistic context cues influence compensation patterns in the model's output.
arXiv Detail & Related papers (2024-06-21T15:58:22Z) - Exploring How Generative Adversarial Networks Learn Phonological
Representations [6.119392435448723]
Generative Adversarial Networks (GANs) learn representations of phonological phenomena.
We analyze how GANs encode contrastive and non-contrastive nasality in French and English vowels.
arXiv Detail & Related papers (2023-05-21T16:37:21Z) - Unify and Conquer: How Phonetic Feature Representation Affects Polyglot
Text-To-Speech (TTS) [3.57486761615991]
unified representations consistently achieves better cross-lingual synthesis with respect to both naturalness and accent.
Separate representations tend to have an order of magnitude more tokens than unified ones, which may affect model capacity.
arXiv Detail & Related papers (2022-07-04T16:14:57Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Do Acoustic Word Embeddings Capture Phonological Similarity? An
Empirical Study [12.210797811981173]
In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity?
We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity.
Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity.
arXiv Detail & Related papers (2021-06-16T10:47:56Z) - Decomposing lexical and compositional syntax and semantics with deep
language models [82.81964713263483]
The activations of language transformers like GPT2 have been shown to linearly map onto brain activity during speech comprehension.
Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four classes: lexical, compositional, syntactic, and semantic representations.
The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices.
arXiv Detail & Related papers (2021-03-02T10:24:05Z) - Neural Representations for Modeling Variation in Speech [9.27189407857061]
We use neural models to compute word-based pronunciation differences between non-native and native speakers of English.
We show that speech representations extracted from a specific type of neural model (i.e. Transformers) lead to a better match with human perception than two earlier approaches.
arXiv Detail & Related papers (2020-11-25T11:19:12Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.