Multi-Encoder Learning and Stream Fusion for Transformer-Based
End-to-End Automatic Speech Recognition
- URL: http://arxiv.org/abs/2104.00120v1
- Date: Wed, 31 Mar 2021 21:07:43 GMT
- Title: Multi-Encoder Learning and Stream Fusion for Transformer-Based
End-to-End Automatic Speech Recognition
- Authors: Timo Lohrenz, Zhengyang Li, Tim Fingscheidt
- Abstract summary: We investigate various fusion techniques for the all-attention-based encoder-decoder architecture known as the transformer.
We introduce a novel multi-encoder learning method that performs a weighted combination of two encoder-decoder multi-head attention outputs only during training.
We achieve state-of-the-art performance for transformer-based models on Wall Street Journal with a significant WER reduction of 19% relative compared to the current benchmark approach.
- Score: 30.941564693248512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stream fusion, also known as system combination, is a common technique in
automatic speech recognition for traditional hybrid hidden Markov model
approaches, yet mostly unexplored for modern deep neural network end-to-end
model architectures. Here, we investigate various fusion techniques for the
all-attention-based encoder-decoder architecture known as the transformer,
striving to achieve optimal fusion by investigating different fusion levels in
an example single-microphone setting with fusion of standard magnitude and
phase features. We introduce a novel multi-encoder learning method that
performs a weighted combination of two encoder-decoder multi-head attention
outputs only during training. Employing then only the magnitude feature encoder
in inference, we are able to show consistent improvement on Wall Street Journal
(WSJ) with language model and on Librispeech, without increase in runtime or
parameters. Combining two such multi-encoder trained models by a simple late
fusion in inference, we achieve state-of-the-art performance for
transformer-based models on WSJ with a significant WER reduction of 19\%
relative compared to the current benchmark approach.
Related papers
- U-MixFormer: UNet-like Transformer with Mix-Attention for Efficient
Semantic Segmentation [0.0]
CNN-based U-Net has seen significant progress in high-resolution medical imaging and remote sensing.
This dual success inspired us to merge the strengths of both, leading to the inception of a U-Net-based vision transformer decoder.
We propose a novel transformer decoder, U-MixFormer, built upon the U-Net structure, designed for efficient semantic segmentation.
arXiv Detail & Related papers (2023-12-11T10:19:42Z) - Transformer Fusion with Optimal Transport [25.022849817421964]
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities.
This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components.
arXiv Detail & Related papers (2023-10-09T13:40:31Z) - Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion.
In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z) - Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix
Factorization [60.91600465922932]
We present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder.
Our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods.
arXiv Detail & Related papers (2022-10-23T00:32:04Z) - Adaptive re-calibration of channel-wise features for Adversarial Audio
Classification [0.0]
We propose a recalibration of features using attention feature fusion for synthetic speech detection.
We compare its performance against different detection methods including End2End models and Resnet-based models.
We also demonstrate that the combination of Linear frequency cepstral coefficients (LFCC) and Mel Frequency cepstral coefficients (MFCC) using the attentional feature fusion technique creates better input features representations.
arXiv Detail & Related papers (2022-10-21T04:21:56Z) - Multimodal Image Fusion based on Hybrid CNN-Transformer and Non-local
Cross-modal Attention [12.167049432063132]
We present a hybrid model consisting of a convolutional encoder and a Transformer-based decoder to fuse multimodal images.
A branch fusion module is designed to adaptively fuse the features of the two branches.
arXiv Detail & Related papers (2022-10-18T13:30:52Z) - String-based Molecule Generation via Multi-decoder VAE [56.465033997245776]
We investigate the problem of string-based molecular generation via variational autoencoders (VAEs)
We propose a simple, yet effective idea to improve the performance of VAE for the task.
In our experiments, the proposed VAE model particularly performs well for generating a sample from out-of-domain distribution.
arXiv Detail & Related papers (2022-08-23T03:56:30Z) - Image Fusion Transformer [75.71025138448287]
In image fusion, images obtained from different sensors are fused to generate a single image with enhanced information.
In recent years, state-of-the-art methods have adopted Convolution Neural Networks (CNNs) to encode meaningful features for image fusion.
We propose a novel Image Fusion Transformer (IFT) where we develop a transformer-based multi-scale fusion strategy.
arXiv Detail & Related papers (2021-07-19T16:42:49Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.