Neural Representations for Modeling Variation in Speech
- URL: http://arxiv.org/abs/2011.12649v3
- Date: Wed, 26 Jan 2022 13:41:25 GMT
- Title: Neural Representations for Modeling Variation in Speech
- Authors: Martijn Bartelds, Wietse de Vries, Faraz Sanal, Caitlin Richter, Mark
Liberman, Martijn Wieling
- Abstract summary: We use neural models to compute word-based pronunciation differences between non-native and native speakers of English.
We show that speech representations extracted from a specific type of neural model (i.e. Transformers) lead to a better match with human perception than two earlier approaches.
- Score: 9.27189407857061
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Variation in speech is often quantified by comparing phonetic transcriptions
of the same utterance. However, manually transcribing speech is time-consuming
and error prone. As an alternative, therefore, we investigate the extraction of
acoustic embeddings from several self-supervised neural models. We use these
representations to compute word-based pronunciation differences between
non-native and native speakers of English, and between Norwegian dialect
speakers. For comparison with several earlier studies, we evaluate how well
these differences match human perception by comparing them with available human
judgements of similarity. We show that speech representations extracted from a
specific type of neural model (i.e. Transformers) lead to a better match with
human perception than two earlier approaches on the basis of phonetic
transcriptions and MFCC-based acoustic features. We furthermore find that
features from the neural models can generally best be extracted from one of the
middle hidden layers than from the final layer. We also demonstrate that neural
speech representations not only capture segmental differences, but also
intonational and durational differences that cannot adequately be represented
by a set of discrete symbols used in phonetic transcriptions.
Related papers
- Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0 [0.11510009152620666]
We study how Wav2Vec2 resolves phonotactic constraints.
We synthesize sounds on an acoustic continuum between /l/ and /r/ and embed them in controlled contexts.
Like humans, Wav2Vec2 models show a bias towards the phonotactically admissable category in processing such ambiguous sounds.
arXiv Detail & Related papers (2024-07-03T11:04:31Z) - Perception of Phonological Assimilation by Neural Speech Recognition Models [3.4173734484549625]
This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds.
Using psycholinguistic stimuli, we analyze how various linguistic context cues influence compensation patterns in the model's output.
arXiv Detail & Related papers (2024-06-21T15:58:22Z) - Establishing degrees of closeness between audio recordings along
different dimensions using large-scale cross-lingual models [4.349838917565205]
We propose a new unsupervised method using ABX tests on audio recordings with carefully curated metadata.
Three experiments are devised: one on room acoustics aspects, one on linguistic genre, and one on phonetic aspects.
The results confirm that the representations extracted from recordings with different linguistic/extra-linguistic characteristics differ along the same lines.
arXiv Detail & Related papers (2024-02-08T11:31:23Z) - Agentivit\`a e telicit\`a in GilBERTo: implicazioni cognitive [77.71680953280436]
The goal of this study is to investigate whether a Transformer-based neural language model infers lexical semantics.
The semantic properties considered are telicity (also combined with definiteness) and agentivity.
arXiv Detail & Related papers (2023-07-06T10:52:22Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Quantifying Language Variation Acoustically with Few Resources [4.162663632560141]
Deep acoustic models might have learned linguistic information that transfers to low-resource languages.
We compute pairwise pronunciation differences averaged over 10 words for over 100 individual dialects from four (regional) languages.
Our results show that acoustic models outperform the (traditional) transcription-based approach without requiring phonetic transcriptions.
arXiv Detail & Related papers (2022-05-05T15:00:56Z) - How Familiar Does That Sound? Cross-Lingual Representational Similarity
Analysis of Acoustic Word Embeddings [12.788276426899312]
We present a novel design based on representational similarity analysis (RSA) to analyze acoustic word embeddings (AWEs)
First, we train monolingual AWE models on seven Indo-European languages with various degrees of typological similarity.
We then employ RSA to quantify the cross-lingual similarity by simulating native and non-native spoken-word processing using AWEs.
arXiv Detail & Related papers (2021-09-21T13:51:39Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Mechanisms for Handling Nested Dependencies in Neural-Network Language
Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing.
Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement.
We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.