BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge
- URL: http://arxiv.org/abs/2101.12729v1
- Date: Fri, 29 Jan 2021 18:40:54 GMT
- Title: BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge
- Authors: Martin Kocour, Guillermo C\'ambara, Jordi Luque, David Bonet, Mireia
Farr\'us, Martin Karafi\'at, Karel Vesel\'y and Jan ''Honza'' \^Cernock\'y
- Abstract summary: This paper describes joint effort of BUT and Telef'onica Research on development of Automatic Speech Recognition systems.
We compare approaches based on either hybrid or end-to-end models.
A fusion of our best systems achieved 23.33% WER in official Albayzin 2020 evaluations.
- Score: 2.675158177232256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes joint effort of BUT and Telef\'onica Research on
development of Automatic Speech Recognition systems for Albayzin 2020
Challenge. We compare approaches based on either hybrid or end-to-end models.
In hybrid modelling, we explore the impact of SpecAugment layer on performance.
For end-to-end modelling, we used a convolutional neural network with gated
linear units (GLUs). The performance of such model is also evaluated with an
additional n-gram language model to improve word error rates. We further
inspect source separation methods to extract speech from noisy environment
(i.e. TV shows). More precisely, we assess the effect of using a neural-based
music separator named Demucs. A fusion of our best systems achieved 23.33% WER
in official Albayzin 2020 evaluations. Aside from techniques used in our final
submitted systems, we also describe our efforts in retrieving high quality
transcripts for training.
Related papers
- From Modular to End-to-End Speaker Diarization [3.079020586262228]
We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx.
We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps.
We show how this method generating simulated conversations'' allows for better performance than using a previously proposed method for creating simulated mixtures'' when training the popular EEND.
arXiv Detail & Related papers (2024-06-27T15:09:39Z) - Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models.
We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective.
The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Dialogue-Contextualized Re-ranking for Medical History-Taking [5.039849340960835]
We present a two-stage re-ranking approach that helps close the training-inference gap by re-ranking the first-stage question candidates.
We find that relative to the expert system, the best performance is achieved by our proposed global re-ranker with a transformer backbone.
arXiv Detail & Related papers (2023-04-04T17:31:32Z) - Conformer-based Hybrid ASR System for Switchboard Dataset [99.88988282353206]
We present and evaluate a competitive conformer-based hybrid model training recipe.
We study different training aspects and methods to improve word-error-rate as well as to increase training speed.
We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results.
arXiv Detail & Related papers (2021-11-05T12:03:18Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task.
We trained our models with the officially provided ASR and MT datasets.
To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.