Related papers: Multistream CNN for Robust Acoustic Modeling

Multistream CNN for Robust Acoustic Modeling

URL: http://arxiv.org/abs/2005.10470v2
Date: Sun, 25 Apr 2021 05:47:28 GMT
Title: Multistream CNN for Robust Acoustic Modeling
Authors: Kyu J. Han, Jing Pan, Venkata Krishna Naveen Tadala, Tao Ma and Dan Povey
Abstract summary: Multistream CNN is a novel neural network architecture for robust acoustic modeling in speech recognition tasks. We show consistent improvements against Kaldi's best TDNN-F model across various data sets. In terms of real-time factor, multistream CNN outperforms the baseline TDNN-F by 15%.
Score: 17.155489701060542
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes multistream CNN, a novel neural network architecture for robust acoustic modeling in speech recognition tasks. The proposed architecture processes input speech with diverse temporal resolutions by applying different dilation rates to convolutional neural networks across multiple streams to achieve the robustness. The dilation rates are selected from the multiples of a sub-sampling rate of 3 frames. Each stream stacks TDNN-F layers (a variant of 1D CNN), and output embedding vectors from the streams are concatenated then projected to the final layer. We validate the effectiveness of the proposed multistream CNN architecture by showing consistent improvements against Kaldi's best TDNN-F model across various data sets. Multistream CNN improves the WER of the test-other set in the LibriSpeech corpus by 12% (relative). On custom data from ASAPP's production ASR system for a contact center, it records a relative WER improvement of 11% for customer channel audio to prove its robustness to data in the wild. In terms of real-time factor, multistream CNN outperforms the baseline TDNN-F by 15%, which also suggests its practicality on production systems. When combined with self-attentive SRU LM rescoring, multistream CNN contributes for ASAPP to achieve the best WER of 1.75% on test-clean in LibriSpeech.

Related papers

OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z)
An Efficient Evolutionary Deep Learning Framework Based on Multi-source Transfer Learning to Evolve Deep Convolutional Neural Networks [8.40112153818812]
Convolutional neural networks (CNNs) have constantly achieved better performance over years by introducing more complex topology, and enlarging the capacity towards deeper and wider CNNs. The computational cost is still the bottleneck of automatically designing CNNs. In this paper, inspired by transfer learning, a new evolutionary computation based framework is proposed to efficiently evolve CNNs.
arXiv Detail & Related papers (2022-12-07T20:22:58Z)
Attention-based Feature Compression for CNN Inference Offloading in Edge Computing [93.67044879636093]
This paper studies the computational offloading of CNN inference in device-edge co-inference systems. We propose a novel autoencoder-based CNN architecture (AECNN) for effective feature extraction at end-device. Experiments show that AECNN can compress the intermediate data by more than 256x with only about 4% accuracy loss.
arXiv Detail & Related papers (2022-11-24T18:10:01Z)
Exploiting Hybrid Models of Tensor-Train Networks for Spoken Command Recognition [9.262289183808035]
This work aims to design a low complexity spoken command recognition (SCR) system. We exploit a deep hybrid architecture of a tensor-train (TT) network to build an end-to-end SRC pipeline. Our proposed CNN+(TT-DNN) model attains a competitive accuracy of 96.31% with 4 times fewer model parameters than the CNN model.
arXiv Detail & Related papers (2022-01-11T05:57:38Z)
The Mind's Eye: Visualizing Class-Agnostic Features of CNNs [92.39082696657874]
We propose an approach to visually interpret CNN features given a set of images by creating corresponding images that depict the most informative features of a specific layer. Our method uses a dual-objective activation and distance loss, without requiring a generator network nor modifications to the original model.
arXiv Detail & Related papers (2021-01-29T07:46:39Z)
Deep Networks for Direction-of-Arrival Estimation in Low SNR [89.45026632977456]
We introduce a Convolutional Neural Network (CNN) that is trained from mutli-channel data of the true array manifold matrix. We train a CNN in the low-SNR regime to predict DoAs across all SNRs. Our robust solution can be applied in several fields, ranging from wireless array sensors to acoustic microphones or sonars.
arXiv Detail & Related papers (2020-11-17T12:52:18Z)
Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement [53.47564132861866]
We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality.
arXiv Detail & Related papers (2020-07-25T22:21:05Z)
A temporal-to-spatial deep convolutional neural network for classification of hand movements from multichannel electromyography data [0.14502611532302037]
We make the novel contribution of proposing and evaluating a design for the early processing layers in the deep CNN for multichannel sEMG. We propose a novel temporal-to-spatial (TtS) CNN architecture, where the first layer performs convolution separately on each sEMG channel to extract temporal features. We find that our novel TtS CNN design achieves 66.6% per-class accuracy on database 1, and 67.8% on database 2.
arXiv Detail & Related papers (2020-07-16T09:11:26Z)
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition [21.554020483837096]
We present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures. In the hybrid ASR framework, the multistream CNN acoustic model processes an input of speech frames in multiple parallel pipelines. We further improve the performance via N-best rescoring using a 24-layer self-attentive SRU language model, achieving WERs of 1.75% on test-clean and 4.46% on test-other.
arXiv Detail & Related papers (2020-05-21T05:18:34Z)
Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR) We propose the convolution-augmented transformer for speech recognition, named Conformer. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.