Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R 2022 Challenge
- URL: http://arxiv.org/abs/2207.14418v1
- Date: Fri, 29 Jul 2022 00:48:40 GMT
- Title: Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R 2022 Challenge
- Authors: Alef Iury Siqueira Ferreira and Gustavo dos Reis Oliveira
- Abstract summary: This paper presents our efforts to build a robust ASR model for the shared task Automatic Speech Recognition for spontaneous and prepared speech & Speech Emotion Recognition in Portuguese (SE&R 2022)
The goal of the challenge is to advance the ASR research for the Portuguese language, considering prepared and spontaneous speech in different dialects.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents our efforts to build a robust ASR model for the shared
task Automatic Speech Recognition for spontaneous and prepared speech & Speech
Emotion Recognition in Portuguese (SE&R 2022). The goal of the challenge is to
advance the ASR research for the Portuguese language, considering prepared and
spontaneous speech in different dialects. Our method consist on fine-tuning an
ASR model in a domain-specific approach, applying gain normalization and
selective noise insertion. The proposed method improved over the strong
baseline provided on the test set in 3 of the 4 tracks available
Related papers
- Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition [26.693942793501204]
We propose a novel language-universal approach to end-to-end automatic spoken keyword recognition (SKR)
Wav2Vec2.0 is used to generate robust speech representations, followed by a linear output layer to produce attribute sequences.
A non-trainable pronunciation model then maps sequences of attributes into spoken keywords in a multilingual setting.
arXiv Detail & Related papers (2024-06-04T16:59:11Z) - ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana [0.0]
We propose a reliable ASR model for each target language by crawling speech corpora spanning diverse sources.
We show that freeze fine-tuning updates and dropout rate are more vital parameters than the total number of epochs of lr.
We liberate our best models -- with no other ASR model reported until now for two Wa'ikhana and Kotiria.
arXiv Detail & Related papers (2024-04-12T10:12:38Z) - Convoifilter: A case study of doing cocktail party speech recognition [59.80042864360884]
The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach.
We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
arXiv Detail & Related papers (2023-08-22T12:09:30Z) - Transsion TSUP's speech recognition system for ASRU 2023 MADASR
Challenge [11.263392524468625]
The system focuses on adapting ASR models for low-resource Indian languages.
The proposed method achieved word error rates (WER) of 24.17%, 24.43%, 15.97%, and 15.97% for Bengali language in the four tracks, and WER of 19.61%, 19.54%, 15.48%, and 15.48% for Bhojpuri language in the four tracks.
arXiv Detail & Related papers (2023-07-20T00:55:01Z) - VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge [95.6159736804855]
The VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22) was held in conjunction with INTERSPEECH 2022.
The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild"
arXiv Detail & Related papers (2023-02-20T19:27:14Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - DUAL: Textless Spoken Question Answering with Speech Discrete Unit
Adaptive Learning [66.71308154398176]
Spoken Question Answering (SQA) has gained research attention and made remarkable progress in recent years.
Existing SQA methods rely on Automatic Speech Recognition (ASR) transcripts, which are time and cost-prohibitive to collect.
This work proposes an ASR transcript-free SQA framework named Discrete Unit Adaptive Learning (DUAL), which leverages unlabeled data for pre-training and is fine-tuned by the SQA downstream task.
arXiv Detail & Related papers (2022-03-09T17:46:22Z) - Sentiment-Aware Automatic Speech Recognition pre-training for enhanced
Speech Emotion Recognition [11.760166084942908]
We propose a novel multi-task pre-training method for Speech Emotion Recognition (SER)
We pre-train SER model simultaneously on Automatic Speech Recognition (ASR) and sentiment classification tasks.
We generate targets for the sentiment classification using text-to-sentiment model trained on publicly available data.
arXiv Detail & Related papers (2022-01-27T22:20:28Z) - On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents.
Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity.
We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.