Transformer with Bidirectional Decoder for Speech Recognition
- URL: http://arxiv.org/abs/2008.04481v1
- Date: Tue, 11 Aug 2020 02:12:42 GMT
- Title: Transformer with Bidirectional Decoder for Speech Recognition
- Authors: Xi Chen and Songyang Zhang and Dandan Song and Peng Ouyang and Shouyi
Yin
- Abstract summary: We introduce a bidirectional speech transformer to utilize the different directional contexts simultaneously.
Specifically, the outputs of our proposed transformer include a left-to-right target, and a right-to-left target.
In inference stage, we use the introduced bidirectional beam search method, which can generate left-to-right candidates and also generate right-to-left candidates.
- Score: 32.56014992915183
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention-based models have made tremendous progress on end-to-end automatic
speech recognition(ASR) recently. However, the conventional transformer-based
approaches usually generate the sequence results token by token from left to
right, leaving the right-to-left contexts unexploited. In this work, we
introduce a bidirectional speech transformer to utilize the different
directional contexts simultaneously. Specifically, the outputs of our proposed
transformer include a left-to-right target, and a right-to-left target. In
inference stage, we use the introduced bidirectional beam search method, which
can not only generate left-to-right candidates but also generate right-to-left
candidates, and determine the best hypothesis by the score.
To demonstrate our proposed speech transformer with a bidirectional
decoder(STBD), we conduct extensive experiments on the AISHELL-1 dataset. The
results of experiments show that STBD achieves a 3.6\% relative CER
reduction(CERR) over the unidirectional speech transformer baseline. Besides,
the strongest model in this paper called STBD-Big can achieve 6.64\% CER on the
test set, without language model rescoring and any extra data augmentation
strategies.
Related papers
- Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement [17.645026729525462]
We propose a transformer-based end-to-end model to extract a target speaker's speech from a mixed audio signal.
Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points.
arXiv Detail & Related papers (2024-09-02T16:11:12Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Non-autoregressive Transformer with Unified Bidirectional Decoder for
Automatic Speech Recognition [20.93536420298548]
We propose a new non-autoregressive transformer with a unified decoder (NAT-UBD)
NAT-UBD can achieve character error rates (CERs) of 5.0%/5.5% on the Aishell1 dev/test sets, outperforming all previous NAR transformer models.
arXiv Detail & Related papers (2021-09-14T13:39:39Z) - BeamTransformer: Microphone Array-based Overlapping Speech Detection [52.11665331754917]
BeamTransformer seeks to optimize modeling of sequential relationship among signals from different spatial direction.
BeamTransformer exceeds in learning to identify the relationship among different beam sequences.
BeamTransformer takes one step further, as speech from overlapped speakers have been internally separated into different beams.
arXiv Detail & Related papers (2021-09-09T06:10:48Z) - Duplex Sequence-to-Sequence Learning for Reversible Machine Translation [53.924941333388155]
Sequence-to-sequence (seq2seq) problems such as machine translation are bidirectional.
We propose a em duplex seq2seq neural network, REDER, and apply it to machine translation.
Experiments on widely-used machine translation benchmarks verify that REDER achieves the first success of reversible machine translation.
arXiv Detail & Related papers (2021-05-07T18:21:57Z) - Transformer Based Deliberation for Two-Pass Speech Recognition [46.86118010771703]
Speech recognition systems must generate words quickly while also producing accurate results.
Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate.
Previous work has established that a deliberation network can be an effective second-pass model.
arXiv Detail & Related papers (2021-01-27T18:05:22Z) - Fast Interleaved Bidirectional Sequence Generation [90.58793284654692]
We introduce a decoder that generates target words from the left-to-right and right-to-left directions simultaneously.
We show that we can easily convert a standard architecture for unidirectional decoding into a bidirectional decoder.
Our interleaved bidirectional decoder (IBDecoder) retains the model simplicity and training efficiency of the standard Transformer.
arXiv Detail & Related papers (2020-10-27T17:38:51Z) - Open-Domain Dialogue Generation Based on Pre-trained Language Models [23.828348485513043]
Pre-trained language models have been successfully used in response generation for open-domain dialogue.
Four main frameworks have been proposed: Transformer-ED using Transformer encoder and decoder separately for source and target sentences; (2) Transformer-Dec using Transformer decoder for both source and target sentences; and (3) Transformer-MLM using Transformer decoder that applies bi-directional attention on the source side and left-to-right attention on the target side with masked language model objective.
We compare these frameworks on 3 datasets, and our comparison reveals that the best framework uses bidirectional attention on the source side and does not separate encoder and decoder.
arXiv Detail & Related papers (2020-10-24T04:52:28Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.