TRANS-BLSTM: Transformer with Bidirectional LSTM for Language
Understanding
- URL: http://arxiv.org/abs/2003.07000v1
- Date: Mon, 16 Mar 2020 03:38:51 GMT
- Title: TRANS-BLSTM: Transformer with Bidirectional LSTM for Language
Understanding
- Authors: Zhiheng Huang, Peng Xu, Davis Liang, Ajay Mishra, Bing Xiang
- Abstract summary: Bidirectional Representations from Transformers (BERT) has recently achieved state-of-the-art performance on a broad range of NLP tasks.
We propose a new architecture denoted as Transformer with BLSTM (TRANS-BLSTM) which has a BLSTM layer integrated to each transformer block.
We show that TRANS-BLSTM models consistently lead to improvements in accuracy compared to BERT baselines in GLUE and SQuAD 1.1 experiments.
- Score: 18.526060699574142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bidirectional Encoder Representations from Transformers (BERT) has recently
achieved state-of-the-art performance on a broad range of NLP tasks including
sentence classification, machine translation, and question answering. The BERT
model architecture is derived primarily from the transformer. Prior to the
transformer era, bidirectional Long Short-Term Memory (BLSTM) has been the
dominant modeling architecture for neural machine translation and question
answering. In this paper, we investigate how these two modeling techniques can
be combined to create a more powerful model architecture. We propose a new
architecture denoted as Transformer with BLSTM (TRANS-BLSTM) which has a BLSTM
layer integrated to each transformer block, leading to a joint modeling
framework for transformer and BLSTM. We show that TRANS-BLSTM models
consistently lead to improvements in accuracy compared to BERT baselines in
GLUE and SQuAD 1.1 experiments. Our TRANS-BLSTM model obtains an F1 score of
94.01% on the SQuAD 1.1 development dataset, which is comparable to the
state-of-the-art result.
Related papers
- Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - xLSTM: Extended Long Short-Term Memory [26.607656211983155]
In the 1990s, constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM)
We introduce exponential gating with appropriate normalization and stabilization techniques.
We modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule.
arXiv Detail & Related papers (2024-05-07T17:50:21Z) - Transformers versus LSTMs for electronic trading [0.0]
This study investigates whether Transformer-based model can be applied in financial time series prediction and beat LSTM.
A new LSTM-based model called DLSTM is built and new architecture for the Transformer-based model is designed to adapt for financial prediction.
The experiment result reflects that the Transformer-based model only has the limited advantage in absolute price sequence prediction.
arXiv Detail & Related papers (2023-09-20T15:25:43Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - Rewiring the Transformer with Depth-Wise LSTMs [55.50278212605607]
We present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers.
Experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task.
arXiv Detail & Related papers (2020-07-13T09:19:34Z) - Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR
in Transfer Learning [37.55706646713447]
We propose a hybrid Transformer-LSTM based architecture to improve low-resource end-to-end ASR.
We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text.
Overall, our best model outperforms the vanilla Transformer ASR by 11.9% relative WER.
arXiv Detail & Related papers (2020-05-21T00:56:42Z) - Finnish Language Modeling with Deep Transformer Models [10.321630075961465]
We investigate the performance of the Transformer-BERT and Transformer-XL for the language modeling task.
BERT achieves a pseudo-perplexity score of 14.5, which is the first such measure achieved as far as we know.
Transformer-XL improves upon the perplexity score to 73.58 which is 27% better than the LSTM model.
arXiv Detail & Related papers (2020-03-14T15:12:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.