Related papers: Acoustic characterization of speech rhythm: going beyond metrics with recurrent neural networks

Acoustic characterization of speech rhythm: going beyond metrics with recurrent neural networks

URL: http://arxiv.org/abs/2401.14416v1
Date: Mon, 22 Jan 2024 09:49:44 GMT
Title: Acoustic characterization of speech rhythm: going beyond metrics with recurrent neural networks
Authors: Fran\c{c}ois Deloche, Laurent Bonnasse-Gahot, Judit Gervain
Abstract summary: We train a recurrent neural network on a language identification task over a large database of speech recordings in 21 languages. The network was able to identify the language of 10-second recordings in 40% of the cases, and the language was in the top-3 guesses in two-thirds of the cases.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Languages have long been described according to their perceived rhythmic attributes. The associated typologies are of interest in psycholinguistics as they partly predict newborns' abilities to discriminate between languages and provide insights into how adult listeners process non-native languages. Despite the relative success of rhythm metrics in supporting the existence of linguistic rhythmic classes, quantitative studies have yet to capture the full complexity of temporal regularities associated with speech rhythm. We argue that deep learning offers a powerful pattern-recognition approach to advance the characterization of the acoustic bases of speech rhythm. To explore this hypothesis, we trained a medium-sized recurrent neural network on a language identification task over a large database of speech recordings in 21 languages. The network had access to the amplitude envelopes and a variable identifying the voiced segments, assuming that this signal would poorly convey phonetic information but preserve prosodic features. The network was able to identify the language of 10-second recordings in 40% of the cases, and the language was in the top-3 guesses in two-thirds of the cases. Visualization methods show that representations built from the network activations are consistent with speech rhythm typologies, although the resulting maps are more complex than two separated clusters between stress and syllable-timed languages. We further analyzed the model by identifying correlations between network activations and known speech rhythm metrics. The findings illustrate the potential of deep learning tools to advance our understanding of speech rhythm through the identification and exploration of linguistically relevant acoustic feature spaces.

Related papers

Developmental Predictive Coding Model for Early Infancy Mono and Bilingual Vocal Continual Learning [69.8008228833895]
We propose a small-sized generative neural network equipped with a continual learning mechanism. Our model prioritizes interpretability and demonstrates the advantages of online learning.
arXiv Detail & Related papers (2024-12-23T10:23:47Z)
Explaining Spectrograms in Machine Learning: A Study on Neural Networks for Speech Classification [2.4472308031704073]
This study investigates discriminative patterns learned by neural networks for accurate speech classification. By examining the activations and features of neural networks for vowel classification, we gain insights into what the networks "see" in spectrograms.
arXiv Detail & Related papers (2024-07-10T07:37:18Z)
Toward a realistic model of speech processing in the brain with self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate. We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores. Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z)
Unsupervised Multimodal Word Discovery based on Double Articulation Analysis with Co-occurrence cues [7.332652485849632]
Human infants acquire their verbal lexicon with minimal prior knowledge of language. This study proposes a novel fully unsupervised learning method for discovering speech units. The proposed method can acquire words and phonemes from speech signals using unsupervised learning.
arXiv Detail & Related papers (2022-01-18T07:31:59Z)
Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide. Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages. We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z)
Deep Learning For Prominence Detection In Children's Read Speech [13.041607703862724]
We present a system that operates on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment. The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters.
arXiv Detail & Related papers (2021-10-27T08:51:42Z)
Perception Point: Identifying Critical Learning Periods in Speech for Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models. We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z)
Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation [2.28438857884398]
We study the so-called latent language hypothesis (LLH) LLH connects linguistic representation learning to general predictive processing within and across sensory modalities. We explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning.
arXiv Detail & Related papers (2021-09-29T05:49:46Z)
Decomposing lexical and compositional syntax and semantics with deep language models [82.81964713263483]
The activations of language transformers like GPT2 have been shown to linearly map onto brain activity during speech comprehension. Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four classes: lexical, compositional, syntactic, and semantic representations. The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices.
arXiv Detail & Related papers (2021-03-02T10:24:05Z)
Generative Adversarial Phonology: Modeling unsupervised phonetic and phonological learning with neural networks [0.0]
Training deep neural networks on well-understood dependencies in speech data can provide new insights into how they learn internal representations. This paper argues that acquisition of speech can be modeled as a dependency between random space and generated speech data in the Generative Adversarial Network architecture. We propose a methodology to uncover the network's internal representations that correspond to phonetic and phonological properties.
arXiv Detail & Related papers (2020-06-06T20:31:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.