Related papers: End-to-End Multi-speaker Speech Recognition with Transformer

End-to-End Multi-speaker Speech Recognition with Transformer

URL: http://arxiv.org/abs/2002.03921v2
Date: Thu, 13 Feb 2020 00:50:39 GMT
Title: End-to-End Multi-speaker Speech Recognition with Transformer
Authors: Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
Abstract summary: We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
Score: 88.22355110349933
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER.

Related papers

Improving Transformer-based Networks With Locality For Automatic Speaker Verification [40.06788577864032]
Transformer-based architectures have been explored for speaker embedding extraction. In this study, we enhance the Transformer with the locality modeling in two directions. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset.
arXiv Detail & Related papers (2023-02-17T01:04:51Z)
Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z)
Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input. Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z)
Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations. We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z)
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [60.97904439526213]
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
arXiv Detail & Related papers (2021-04-22T17:07:41Z)
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition [21.554020483837096]
We present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures. In the hybrid ASR framework, the multistream CNN acoustic model processes an input of speech frames in multiple parallel pipelines. We further improve the performance via N-best rescoring using a 24-layer self-attentive SRU language model, achieving WERs of 1.75% on test-clean and 4.46% on test-other.
arXiv Detail & Related papers (2020-05-21T05:18:34Z)
Simplified Self-Attention for Transformer-based End-to-End Speech Recognition [56.818507476125895]
We propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors. We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks.
arXiv Detail & Related papers (2020-05-21T04:55:59Z)
Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR) We propose the convolution-augmented transformer for speech recognition, named Conformer. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z)
Research on Modeling Units of Transformer Transducer for Mandarin Speech Recognition [13.04590477394637]
We propose a novel transformer transducer with the combination architecture of self-attention transformer and RNN. Experiments are conducted on about 12,000 hours of Mandarin speech with sampling rate in 8kHz and 16kHz. It yields an average of 14.4% and 44.1% relative Word Error Rate (WER) reduction when compared with the models using syllable initial/final with tone and Chinese character.
arXiv Detail & Related papers (2020-04-26T05:12:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.