Improving the fusion of acoustic and text representations in RNN-T
- URL: http://arxiv.org/abs/2201.10240v1
- Date: Tue, 25 Jan 2022 11:20:50 GMT
- Title: Improving the fusion of acoustic and text representations in RNN-T
- Authors: Chao Zhang, Bo Li, Zhiyun Lu, Tara N. Sainath and Shuo-yiin Chang
- Abstract summary: We propose to use gating, bilinear pooling, and a combination of them in the joint network to produce more expressive representations.
We show that the joint use of the proposed methods can result in 4%--5% relative word error rate reductions with only a few million extra parameters.
- Score: 35.43599666228086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recurrent neural network transducer (RNN-T) has recently become the
mainstream end-to-end approach for streaming automatic speech recognition
(ASR). To estimate the output distributions over subword units, RNN-T uses a
fully connected layer as the joint network to fuse the acoustic representations
extracted using the acoustic encoder with the text representations obtained
using the prediction network based on the previous subword units. In this
paper, we propose to use gating, bilinear pooling, and a combination of them in
the joint network to produce more expressive representations to feed into the
output layer. A regularisation method is also proposed to enable better
acoustic encoder training by reducing the gradients back-propagated into the
prediction network at the beginning of RNN-T training. Experimental results on
a multilingual ASR setting for voice search over nine languages show that the
joint use of the proposed methods can result in 4%--5% relative word error rate
reductions with only a few million extra parameters.
Related papers
- Improving RNN-Transducers with Acoustic LookAhead [32.19475947986392]
RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion.
We propose LookAhead that makes text representations more acoustically grounded by looking ahead into the future.
arXiv Detail & Related papers (2023-07-11T03:57:00Z) - Return of the RNN: Residual Recurrent Networks for Invertible Sentence
Embeddings [0.0]
This study presents a novel model for invertible sentence embeddings using a residual recurrent network trained on an unsupervised encoding task.
Rather than the probabilistic outputs common to neural machine translation models, our approach employs a regression-based output layer to reconstruct the input sequence's word vectors.
The model achieves high accuracy and fast training with the ADAM, a significant finding given that RNNs typically require memory units, such as LSTMs, or second-order optimization methods.
arXiv Detail & Related papers (2023-03-23T15:59:06Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - VQ-T: RNN Transducers using Vector-Quantized Prediction Network States [52.48566999668521]
We propose to use vector-quantized long short-term memory units in the prediction network of RNN transducers.
By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation.
Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks.
arXiv Detail & Related papers (2022-08-03T02:45:52Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Deep Networks for Direction-of-Arrival Estimation in Low SNR [89.45026632977456]
We introduce a Convolutional Neural Network (CNN) that is trained from mutli-channel data of the true array manifold matrix.
We train a CNN in the low-SNR regime to predict DoAs across all SNRs.
Our robust solution can be applied in several fields, ranging from wireless array sensors to acoustic microphones or sonars.
arXiv Detail & Related papers (2020-11-17T12:52:18Z) - Multitask Learning and Joint Optimization for Transformer-RNN-Transducer
Speech Recognition [13.198689566654107]
This paper explores multitask learning, joint optimization, and joint decoding methods for transformer-RNN-transducer systems.
We show that the proposed methods can reduce word error rate (WER) by 16.6 % and 13.3 % for test-clean and test-other datasets, respectively.
arXiv Detail & Related papers (2020-11-02T06:38:06Z) - Exploring Pre-training with Alignments for RNN Transducer based
End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research.
In this work, we leverage external alignments to seed the RNN-T model.
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.