Related papers: DNN-Based Semantic Model for Rescoring N-best Speech Recognition List

DNN-Based Semantic Model for Rescoring N-best Speech Recognition List

URL: http://arxiv.org/abs/2011.00975v1
Date: Mon, 2 Nov 2020 13:50:59 GMT
Title: DNN-Based Semantic Model for Rescoring N-best Speech Recognition List
Authors: Dominique Fohr, Irina Illina
Abstract summary: The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc. This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features.
Score: 8.934497552812012
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc. In this case, the acoustic information can be less reliable. This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features. We propose to perform this through rescoring of the ASR N-best hypotheses list. To achieve this, we train a deep neural network (DNN). Our DNN rescoring model is aimed at selecting hypotheses that have better semantic consistency and therefore lower WER. We investigate two types of representations as part of input features to our DNN model: static word embeddings (from word2vec) and dynamic contextual embeddings (from BERT). Acoustic and linguistic features are also included. We perform experiments on the publicly available dataset TED-LIUM mixed with real noise. The proposed rescoring approaches give significant improvement of the WER over the ASR system without rescoring models in two noisy conditions and with n-gram and RNNLM.

Related papers

HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
Streaming Speech-to-Confusion Network Speech Recognition [19.720334657478475]
We present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency. We show that 1-best results of our model are on par with a comparable RNN-T system. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.
arXiv Detail & Related papers (2023-06-02T20:28:14Z)
Prediction of speech intelligibility with DNN-based performance measures [9.883633991083789]
This paper presents a speech intelligibility model based on automatic speech recognition (ASR) It combines phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities. The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.
arXiv Detail & Related papers (2022-03-17T08:05:38Z)
Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition [18.83748866242237]
We focus on a rigorous and empirical "closed-model adversarial robustness" setting. We propose an advanced Bayesian neural network (BNN) based adversarial detector. We improve detection rate by +2.77 to +5.42% (relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on LibriSpeech datasets.
arXiv Detail & Related papers (2022-02-17T09:17:58Z)
CS-Rep: Making Speaker Verification Networks Embracing Re-parameterization [27.38202134344989]
This study proposes cross-sequential re- parameterization (CS-Rep) to increase the inference speed and verification accuracy of models. Rep-TDNN increases the actual inference speed by about 50% and reduces the EER by 10%.
arXiv Detail & Related papers (2021-10-26T08:00:03Z)
Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning. To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z)
Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals. We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.