End-to-End Multi-speaker Speech Recognition with Transformer
- URL: http://arxiv.org/abs/2002.03921v2
- Date: Thu, 13 Feb 2020 00:50:39 GMT
- Title: End-to-End Multi-speaker Speech Recognition with Transformer
- Authors: Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji
Watanabe
- Abstract summary: We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
- Score: 88.22355110349933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, fully recurrent neural network (RNN) based end-to-end models have
been proven to be effective for multi-speaker speech recognition in both the
single-channel and multi-channel scenarios. In this work, we explore the use of
Transformer models for these tasks by focusing on two aspects. First, we
replace the RNN-based encoder-decoder in the speech recognition model with a
Transformer architecture. Second, in order to use the Transformer in the
masking network of the neural beamformer in the multi-channel case, we modify
the self-attention component to be restricted to a segment rather than the
whole sequence in order to reduce computation. Besides the model architecture
improvements, we also incorporate an external dereverberation preprocessing,
the weighted prediction error (WPE), enabling our model to handle reverberated
signals. Experiments on the spatialized wsj1-2mix corpus show that the
Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down
to 12.1% and 6.4% WER, under the anechoic condition in single-channel and
multi-channel tasks, respectively, while in the reverberant case, our methods
achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER.
Related papers
- Improving Transformer-based Networks With Locality For Automatic Speaker
Verification [40.06788577864032]
Transformer-based architectures have been explored for speaker embedding extraction.
In this study, we enhance the Transformer with the locality modeling in two directions.
We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset.
arXiv Detail & Related papers (2023-02-17T01:04:51Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text [60.97904439526213]
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
arXiv Detail & Related papers (2021-04-22T17:07:41Z) - ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech
Recognition [21.554020483837096]
We present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures.
In the hybrid ASR framework, the multistream CNN acoustic model processes an input of speech frames in multiple parallel pipelines.
We further improve the performance via N-best rescoring using a 24-layer self-attentive SRU language model, achieving WERs of 1.75% on test-clean and 4.46% on test-other.
arXiv Detail & Related papers (2020-05-21T05:18:34Z) - Simplified Self-Attention for Transformer-based End-to-End Speech
Recognition [56.818507476125895]
We propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors.
We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks.
arXiv Detail & Related papers (2020-05-21T04:55:59Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - Research on Modeling Units of Transformer Transducer for Mandarin Speech
Recognition [13.04590477394637]
We propose a novel transformer transducer with the combination architecture of self-attention transformer and RNN.
Experiments are conducted on about 12,000 hours of Mandarin speech with sampling rate in 8kHz and 16kHz.
It yields an average of 14.4% and 44.1% relative Word Error Rate (WER) reduction when compared with the models using syllable initial/final with tone and Chinese character.
arXiv Detail & Related papers (2020-04-26T05:12:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.