Generalizing RNN-Transducer to Out-Domain Audio via Sparse
Self-Attention Layers
- URL: http://arxiv.org/abs/2108.10752v1
- Date: Sun, 22 Aug 2021 08:06:15 GMT
- Title: Generalizing RNN-Transducer to Out-Domain Audio via Sparse
Self-Attention Layers
- Authors: Juntae Kim, Jeehye Lee, Yoonhan Lee
- Abstract summary: Recurrent neural network transducers (RNN-T) are a promising end-to-end speech recognition framework.
The Conformer can effectively model the local-global context information via its convolution and self-attention layers.
The domain mismatch problem for Conformer RNN-T has not been intensively investigated yet.
- Score: 7.025709586759655
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recurrent neural network transducers (RNN-T) are a promising end-to-end
speech recognition framework that transduces input acoustic frames into a
character sequence. The state-of-the-art encoder network for RNN-T is the
Conformer, which can effectively model the local-global context information via
its convolution and self-attention layers. Although Conformer RNN-T has shown
outstanding performance (measured by word error rate (WER) in general), most
studies have been verified in the setting where the train and test data are
drawn from the same domain. The domain mismatch problem for Conformer RNN-T has
not been intensively investigated yet, which is an important issue for the
product-level speech recognition system. In this study, we identified that
fully connected self-attention layers in the Conformer caused high deletion
errors, specifically in the long-form out-domain utterances. To address this
problem, we introduce sparse self-attention layers for Conformer-based encoder
networks, which can exploit local and generalized global information by pruning
most of the in-domain fitted global connections. Further, we propose a state
reset method for the generalization of the prediction network to cope with
long-form utterances. Applying proposed methods to an out-domain test, we
obtained 24.6\% and 6.5\% relative character error rate (CER) reduction
compared to the fully connected and local self-attention layer-based
Conformers, respectively.
Related papers
- Locality-Aware Generalizable Implicit Neural Representation [54.93702310461174]
Generalizable implicit neural representation (INR) enables a single continuous function to represent multiple data instances.
We propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder.
Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks.
arXiv Detail & Related papers (2023-10-09T11:26:58Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter
for Speaker Verification [3.0831477850153224]
We introduce a novel module called Global-aware Filter layer (GF layer) in this work.
We present a dual-stream TDNN architecture called DS-TDNN for automatic speaker verification (ASV)
Experiments on the Voxceleb and SITW databases demonstrate that the DS-TDNN achieves a relative improvement of 10% together with a relative decline of 20% in computational cost.
arXiv Detail & Related papers (2023-03-20T10:58:12Z) - DeepSeer: Interactive RNN Explanation and Debugging via State
Abstraction [10.110976560799612]
Recurrent Neural Networks (RNNs) have been widely used in Natural Language Processing (NLP) tasks.
DeepSeer is an interactive system that provides both global and local explanations of RNN behavior.
arXiv Detail & Related papers (2023-03-02T21:08:17Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - CS-Rep: Making Speaker Verification Networks Embracing
Re-parameterization [27.38202134344989]
This study proposes cross-sequential re- parameterization (CS-Rep) to increase the inference speed and verification accuracy of models.
Rep-TDNN increases the actual inference speed by about 50% and reduces the EER by 10%.
arXiv Detail & Related papers (2021-10-26T08:00:03Z) - Adaptive Anomaly Detection for Internet of Things in Hierarchical Edge
Computing: A Contextual-Bandit Approach [81.5261621619557]
We propose an adaptive anomaly detection scheme with hierarchical edge computing (HEC)
We first construct multiple anomaly detection DNN models with increasing complexity, and associate each of them to a corresponding HEC layer.
Then, we design an adaptive model selection scheme that is formulated as a contextual-bandit problem and solved by using a reinforcement learning policy network.
arXiv Detail & Related papers (2021-08-09T08:45:47Z) - Towards Adversarially Robust and Domain Generalizable Stereo Matching by
Rethinking DNN Feature Backbones [14.569829985753346]
This paper shows that a type of weak white-box attacks can fail state-of-the-art methods.
The proposed method is tested in the SceneFlow dataset and the KITTI2015 benchmark.
It significantly improves the adversarial robustness, while retaining accuracy performance comparable to state-of-the-art methods.
arXiv Detail & Related papers (2021-07-31T22:44:18Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Provable Generalization of SGD-trained Neural Networks of Any Width in
the Presence of Adversarial Label Noise [85.59576523297568]
We consider a one-hidden-layer leaky ReLU network of arbitrary width trained by gradient descent.
We prove that SGD produces neural networks that have classification accuracy competitive with that of the best halfspace over the distribution.
arXiv Detail & Related papers (2021-01-04T18:32:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.