Bayesian Neural Network Language Modeling for Speech Recognition
- URL: http://arxiv.org/abs/2208.13259v1
- Date: Sun, 28 Aug 2022 17:50:19 GMT
- Title: Bayesian Neural Network Language Modeling for Speech Recognition
- Authors: Boyang Xue and Shoukang Hu and Junhao Xu and Mengzhe Geng and Xunying
Liu and Helen Meng
- Abstract summary: State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
- Score: 59.681758762712754
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: State-of-the-art neural network language models (NNLMs) represented by long
short term memory recurrent neural networks (LSTM-RNNs) and Transformers are
becoming highly complex. They are prone to overfitting and poor generalization
when given limited training data. To this end, an overarching full Bayesian
learning framework encompassing three methods is proposed in this paper to
account for the underlying uncertainty in LSTM-RNN and Transformer LMs. The
uncertainty over their model parameters, choice of neural activations and
hidden output representations are modeled using Bayesian, Gaussian Process and
variational LSTM-RNN or Transformer LMs respectively. Efficient inference
approaches were used to automatically select the optimal network internal
components to be Bayesian learned using neural architecture search. A minimal
number of Monte Carlo parameter samples as low as one was also used. These
allow the computational costs incurred in Bayesian NNLM training and evaluation
to be minimized. Experiments are conducted on two tasks: AMI meeting
transcription and Oxford-BBC LipReading Sentences 2 (LRS2) overlapped speech
recognition using state-of-the-art LF-MMI trained factored TDNN systems
featuring data augmentation, speaker adaptation and audio-visual multi-channel
beamforming for overlapped speech. Consistent performance improvements over the
baseline LSTM-RNN and Transformer LMs with point estimated model parameters and
drop-out regularization were obtained across both tasks in terms of perplexity
and word error rate (WER). In particular, on the LRS2 data, statistically
significant WER reductions up to 1.3% and 1.2% absolute (12.1% and 11.3%
relative) were obtained over the baseline LSTM-RNN and Transformer LMs
respectively after model combination between Bayesian NNLMs and their
respective baselines.
Related papers
- Parameter estimation for WMTI-Watson model of white matter using
encoder-decoder recurrent neural network [0.0]
In this study, we evaluate the performance of NLLS, the RNN-based method and a multilayer perceptron (MLP) on datasets rat and human brain.
We showed that the proposed RNN-based fitting approach had the advantage of highly reduced computation time over NLLS.
arXiv Detail & Related papers (2022-03-01T16:33:15Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Low-bit Quantization of Recurrent Neural Network Language Models Using
Alternating Direction Methods of Multipliers [67.688697838109]
This paper presents a novel method to train quantized RNNLMs from scratch using alternating direction methods of multipliers (ADMM)
Experiments on two tasks suggest the proposed ADMM quantization achieved a model size compression factor of up to 31 times over the full precision baseline RNNLMs.
arXiv Detail & Related papers (2021-11-29T09:30:06Z) - Neural Calibration for Scalable Beamforming in FDD Massive MIMO with
Implicit Channel Estimation [10.775558382613077]
Channel estimation and beamforming play critical roles in frequency-division duplexing (FDD) massive multiple-input multiple-output (MIMO) systems.
We propose a deep learning-based approach that directly optimize the beamformers at the base station according to the received uplink pilots.
A neural calibration method is proposed to improve the scalability of the end-to-end design.
arXiv Detail & Related papers (2021-08-03T14:26:14Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Compressing LSTM Networks by Matrix Product Operators [7.395226141345625]
Long Short Term Memory(LSTM) models are the building blocks of many state-of-the-art natural language processing(NLP) and speech enhancement(SE) algorithms.
Here we introduce the MPO decomposition, which describes the local correlation of quantum states in quantum many-body physics.
We propose a matrix product operator(MPO) based neural network architecture to replace the LSTM model.
arXiv Detail & Related papers (2020-12-22T11:50:06Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.