Multi-Channel Transformer Transducer for Speech Recognition
- URL: http://arxiv.org/abs/2108.12953v1
- Date: Mon, 30 Aug 2021 01:50:51 GMT
- Title: Multi-Channel Transformer Transducer for Speech Recognition
- Authors: Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo
- Abstract summary: We present a novel speech recognition model, Multi-Channel Transformer Transducer (MCTT)
MCTT features end-to-end multi-channel training, low cost, and low latency so that it is suitable for streaming decoding in on-device speech recognition.
- Score: 15.268402294151468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-channel inputs offer several advantages over single-channel, to improve
the robustness of on-device speech recognition systems. Recent work on
multi-channel transformer, has proposed a way to incorporate such inputs into
end-to-end ASR for improved accuracy. However, this approach is characterized
by a high computational complexity, which prevents it from being deployed in
on-device systems. In this paper, we present a novel speech recognition model,
Multi-Channel Transformer Transducer (MCTT), which features end-to-end
multi-channel training, low computation cost, and low latency so that it is
suitable for streaming decoding in on-device speech recognition. In a far-field
in-house dataset, our MCTT outperforms stagewise multi-channel models with
transformer-transducer up to 6.01% relative WER improvement (WERR). In
addition, MCTT outperforms the multi-channel transformer up to 11.62% WERR, and
is 15.8 times faster in terms of inference speed. We further show that we can
improve the computational cost of MCTT by constraining the future and previous
context in attention computations.
Related papers
- Low-Latency Task-Oriented Communications with Multi-Round, Multi-Task Deep Learning [45.622060532244944]
We propose a multi-round, multi-task learning (MRMTL) approach for the dynamic update of channel uses in multi-round transmissions.
We show that MRMTL significantly improves the efficiency of task-oriented communications.
arXiv Detail & Related papers (2024-11-15T17:48:06Z) - End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder
and Input Feature Analysis [0.0]
We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder.
arXiv Detail & Related papers (2023-10-16T06:40:18Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Developing Real-time Streaming Transformer Transducer for Speech
Recognition on Large-scale Dataset [37.619200507404145]
Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset.
We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model.
We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario.
arXiv Detail & Related papers (2020-10-22T03:01:21Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.