TransfoRNN: Capturing the Sequential Information in Self-Attention
Representations for Language Modeling
- URL: http://arxiv.org/abs/2104.01572v1
- Date: Sun, 4 Apr 2021 09:31:18 GMT
- Title: TransfoRNN: Capturing the Sequential Information in Self-Attention
Representations for Language Modeling
- Authors: Tze Yuang Chong, Xuyang Wang, Lin Yang, Junjie Wang
- Abstract summary: We propose to cascade the recurrent neural networks to the Transformers, which referred to as the TransfoRNN model, to capture the sequential information.
We found that the TransfoRNN models which consists of only shallow Transformers stack is suffice to give comparable, if not better, performance.
- Score: 9.779600950401315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we describe the use of recurrent neural networks to capture
sequential information from the self-attention representations to improve the
Transformers. Although self-attention mechanism provides a means to exploit
long context, the sequential information, i.e. the arrangement of tokens, is
not explicitly captured. We propose to cascade the recurrent neural networks to
the Transformers, which referred to as the TransfoRNN model, to capture the
sequential information. We found that the TransfoRNN models which consists of
only shallow Transformers stack is suffice to give comparable, if not better,
performance than a deeper Transformer model. Evaluated on the Penn Treebank and
WikiText-2 corpora, the proposed TransfoRNN model has shown lower model
perplexities with fewer number of model parameters. On the Penn Treebank
corpus, the model perplexities were reduced up to 5.5% with the model size
reduced up to 10.5%. On the WikiText-2 corpus, the model perplexity was reduced
up to 2.2% with a 27.7% smaller model. Also, the TransfoRNN model was applied
on the LibriSpeech speech recognition task and has shown comparable results
with the Transformer models.
Related papers
- Improving Transformer-based Networks With Locality For Automatic Speaker
Verification [40.06788577864032]
Transformer-based architectures have been explored for speaker embedding extraction.
In this study, we enhance the Transformer with the locality modeling in two directions.
We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset.
arXiv Detail & Related papers (2023-02-17T01:04:51Z) - Structured State Space Decoder for Speech Recognition and Synthesis [9.354721572095272]
A structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks.
In this study, we applied S4 as a decoder for ASR and text-to-speech tasks by comparing it with the Transformer decoder.
For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88%/4.25%.
arXiv Detail & Related papers (2022-10-31T06:54:23Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z) - Attention is All You Need in Speech Separation [12.57578429586883]
We propose a novel RNN-free Transformer-based neural network for speech separation.
The proposed model achieves state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets.
arXiv Detail & Related papers (2020-10-25T16:28:54Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.