Brazilian Portuguese Speech Recognition Using Wav2vec 2.0
- URL: http://arxiv.org/abs/2107.11414v1
- Date: Fri, 23 Jul 2021 18:54:39 GMT
- Title: Brazilian Portuguese Speech Recognition Using Wav2vec 2.0
- Authors: Lucas Rafael Stefanel Gris, Edresson Casanova, Frederico Santos de
Oliveira, Anderson da Silva Soares, Arnaldo Candido Junior
- Abstract summary: This work presents the development of a public Automatic Speech Recognition system using only open available audio data.
The final model presents a Word Error Rate of 11.95% (Common Voice dataset)
This corresponds to 13% less than the best open Automatic Speech Recognition model for Brazilian Portuguese available according to our best knowledge.
- Score: 0.26097841018267615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning techniques have been shown to be efficient in various tasks,
especially in the development of speech recognition systems, that is, systems
that aim to transcribe a sentence in audio in a sequence of words. Despite the
progress in the area, speech recognition can still be considered difficult,
especially for languages lacking available data, as Brazilian Portuguese. In
this sense, this work presents the development of an public Automatic Speech
Recognition system using only open available audio data, from the fine-tuning
of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages over Brazilian
Portuguese data. The final model presents a Word Error Rate of 11.95% (Common
Voice Dataset). This corresponds to 13% less than the best open Automatic
Speech Recognition model for Brazilian Portuguese available according to our
best knowledge, which is a promising result for the language. In general, this
work validates the use of self-supervising learning techniques, in special, the
use of the Wav2vec 2.0 architecture in the development of robust systems, even
for languages having few available data.
Related papers
- XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs.
XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Large vocabulary speech recognition for languages of Africa:
multilingual modeling and self-supervised learning [11.408563104045285]
Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems.
We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages.
arXiv Detail & Related papers (2022-08-05T09:54:19Z) - Adaptive multilingual speech recognition with pretrained models [24.01587237432548]
We investigate the effectiveness of two pretrained models for two modalities: wav2vec 2.0 for audio and MBART50 for text.
Overall, we noticed an 44% improvement over purely supervised learning.
arXiv Detail & Related papers (2022-05-24T18:29:07Z) - Code Switched and Code Mixed Speech Recognition for Indic languages [0.0]
Training multilingual automatic speech recognition (ASR) systems is challenging because acoustic and lexical information is typically language specific.
We compare the performance of end to end multilingual speech recognition system to the performance of monolingual models conditioned on language identification (LID)
We also propose a similar technique to solve the Code Switched problem and achieve a WER of 21.77 and 28.27 over Hindi-English and Bengali-English respectively.
arXiv Detail & Related papers (2022-03-30T18:09:28Z) - Improved Language Identification Through Cross-Lingual Self-Supervised
Learning [37.32193095549614]
We extend previous self-supervised work on language identification by experimenting with pre-trained models.
Results on a 25 languages setup show that with only 10 minutes of labeled data per language, a cross-lingually pre-trained model can achieve over 93% accuracy.
arXiv Detail & Related papers (2021-07-08T19:37:06Z) - Applying Wav2vec2.0 to Speech Recognition in Various Low-resource
Languages [16.001329145018687]
In the speech domain, wav2vec2.0 starts to show its powerful representation ability and feasibility of ultra-low resource speech recognition on the Librispeech corpus.
However, wav2vec2.0 has not been examined on real spoken scenarios and languages other than English.
We apply pre-trained models to solve low-resource speech recognition tasks in various spoken languages.
arXiv Detail & Related papers (2020-12-22T15:59:44Z) - Universal Phone Recognition with a Multilingual Allophone System [135.2254086165086]
We propose a joint model of language-independent phone and language-dependent phoneme distributions.
In multilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate absolute.
Our recognizer achieves phone accuracy improvements of more than 17%, moving a step closer to speech recognition for all languages in the world.
arXiv Detail & Related papers (2020-02-26T21:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.