Leveraging neural representations for facilitating access to
untranscribed speech from endangered languages
- URL: http://arxiv.org/abs/2103.14583v1
- Date: Fri, 26 Mar 2021 16:44:08 GMT
- Title: Leveraging neural representations for facilitating access to
untranscribed speech from endangered languages
- Authors: Nay San, Martijn Bartelds, Mitchell Browne, Lily Clifford, Fiona
Gibson, John Mansfield, David Nash, Jane Simpson, Myfany Turpin, Maria
Vollmer, Sasha Wilmoth, Dan Jurafsky
- Abstract summary: We use data selected from 7 Australian Aboriginal languages and a regional variety of Dutch.
We find that representations from the middle layers of the wav2vec 2.0 Transformer offer large gains in task performance.
While features extracted using the pre-trained English model yielded improved detection on all the evaluation languages, better detection performance was associated with the evaluation language's phonological similarity to English.
- Score: 10.61744395262441
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For languages with insufficient resources to train speech recognition
systems, query-by-example spoken term detection (QbE-STD) offers a way of
accessing an untranscribed speech corpus by helping identify regions where
spoken query terms occur. Yet retrieval performance can be poor when the query
and corpus are spoken by different speakers and produced in different recording
conditions. Using data selected from a variety of speakers and recording
conditions from 7 Australian Aboriginal languages and a regional variety of
Dutch, all of which are endangered or vulnerable, we evaluated whether QbE-STD
performance on these languages could be improved by leveraging representations
extracted from the pre-trained English wav2vec 2.0 model. Compared to the use
of Mel-frequency cepstral coefficients and bottleneck features, we find that
representations from the middle layers of the wav2vec 2.0 Transformer offer
large gains in task performance (between 56% and 86%). While features extracted
using the pre-trained English model yielded improved detection on all the
evaluation languages, better detection performance was associated with the
evaluation language's phonological similarity to English.
Related papers
- XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs.
XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z) - Predicting positive transfer for improved low-resource speech
recognition using acoustic pseudo-tokens [31.83988006684616]
We show that supplementing the target language with data from a similar, higher-resource 'donor' language can help.
For example, continued pre-training on only 10 hours of low-resource Punjabi supplemented with 60 hours of donor Hindi is almost as good as continued pretraining on 70 hours of Punjabi.
arXiv Detail & Related papers (2024-02-03T23:54:03Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Cross-lingual Self-Supervised Speech Representations for Improved
Dysarthric Speech Recognition [15.136348385992047]
This study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech.
We train an acoustic model with features extracted from Wav2Vec, Hubert, and the cross-lingual XLSR model.
Results suggest that speech representations pretrained on large unlabelled data can improve word error rate (WER) performance.
arXiv Detail & Related papers (2022-04-04T17:36:01Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Is Attention always needed? A Case Study on Language Identification from
Speech [1.162918464251504]
The present study introduces convolutional recurrent neural network (CRNN) based LID.
CRNN based LID is designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples.
The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar.
arXiv Detail & Related papers (2021-10-05T16:38:57Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Applying Wav2vec2.0 to Speech Recognition in Various Low-resource
Languages [16.001329145018687]
In the speech domain, wav2vec2.0 starts to show its powerful representation ability and feasibility of ultra-low resource speech recognition on the Librispeech corpus.
However, wav2vec2.0 has not been examined on real spoken scenarios and languages other than English.
We apply pre-trained models to solve low-resource speech recognition tasks in various spoken languages.
arXiv Detail & Related papers (2020-12-22T15:59:44Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.