From `Snippet-lects' to Doculects and Dialects: Leveraging Neural
Representations of Speech for Placing Audio Signals in a Language Landscape
- URL: http://arxiv.org/abs/2305.18602v1
- Date: Mon, 29 May 2023 20:37:06 GMT
- Title: From `Snippet-lects' to Doculects and Dialects: Leveraging Neural
Representations of Speech for Placing Audio Signals in a Language Landscape
- Authors: S\'everine Guillaume, Guillaume Wisniewski, and Alexis Michaud
- Abstract summary: XLSR-53 a multilingual model of speech, builds a vector representation from audio.
We use max-pooling to aggregate the neural representations from a "snippet-lect" to a "doculect"
Similarity measurements between the 11 corpora bring out greatest closeness between those that are known to be dialects of the same language.
- Score: 3.96673286245683
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: XLSR-53 a multilingual model of speech, builds a vector representation from
audio, which allows for a range of computational treatments. The experiments
reported here use this neural representation to estimate the degree of
closeness between audio files, ultimately aiming to extract relevant linguistic
properties. We use max-pooling to aggregate the neural representations from a
"snippet-lect" (the speech in a 5-second audio snippet) to a "doculect" (the
speech in a given resource), then to dialects and languages. We use data from
corpora of 11 dialects belonging to 5 less-studied languages. Similarity
measurements between the 11 corpora bring out greatest closeness between those
that are known to be dialects of the same language. The findings suggest that
(i) dialect/language can emerge among the various parameters characterizing
audio files and (ii) estimates of overall phonetic/phonological closeness can
be obtained for a little-resourced or fully unknown language. The findings help
shed light on the type of information captured by neural representations of
speech and how it can be extracted from these representations
Related papers
- Literary and Colloquial Dialect Identification for Tamil using Acoustic Features [0.0]
Speech technology plays a role in preserving various dialects of a language from going extinct.
The current work proposes a way to identify two popular and broadly classified Tamil dialects.
arXiv Detail & Related papers (2024-08-27T09:00:27Z) - Establishing degrees of closeness between audio recordings along
different dimensions using large-scale cross-lingual models [4.349838917565205]
We propose a new unsupervised method using ABX tests on audio recordings with carefully curated metadata.
Three experiments are devised: one on room acoustics aspects, one on linguistic genre, and one on phonetic aspects.
The results confirm that the representations extracted from recordings with different linguistic/extra-linguistic characteristics differ along the same lines.
arXiv Detail & Related papers (2024-02-08T11:31:23Z) - Speech language models lack important brain-relevant semantics [6.626540321463248]
Recent work has shown that text-based language models predict both text-evoked and speech-evoked brain activity to an impressive degree.
This poses the question of what types of information language models truly predict in the brain.
arXiv Detail & Related papers (2023-11-08T13:11:48Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect.
dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect.
We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z) - Can phones, syllables, and words emerge as side-products of
cross-situational audiovisual learning? -- A computational investigation [2.28438857884398]
We study the so-called latent language hypothesis (LLH)
LLH connects linguistic representation learning to general predictive processing within and across sensory modalities.
We explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning.
arXiv Detail & Related papers (2021-09-29T05:49:46Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - CSTNet: Contrastive Speech Translation Network for Self-Supervised
Speech Representation Learning [11.552745999302905]
More than half of the 7,000 languages in the world are in imminent danger of going extinct.
It is relatively easy to obtain textual translations corresponding to speech.
We construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech.
arXiv Detail & Related papers (2020-06-04T12:21:48Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.