Hierarchical Attention Transformer Architecture For Syntactic Spell
Correction
- URL: http://arxiv.org/abs/2005.04876v1
- Date: Mon, 11 May 2020 06:19:01 GMT
- Title: Hierarchical Attention Transformer Architecture For Syntactic Spell
Correction
- Authors: Abhishek Niranjan, M Ali Basha Shaik, Kushal Verma
- Abstract summary: We propose multi encoder-single decoder variation of conventional transformer.
We report significant improvement of 0.11%, 0.32% and 0.69% in character (CER), word (WER) and sentence (SER) error rates.
Our architecture is also trains 7.8 times faster, and is only about 1/3 in size from the next most accurate model.
- Score: 1.0312968200748118
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The attention mechanisms are playing a boosting role in advancements in
sequence-to-sequence problems. Transformer architecture achieved new state of
the art results in machine translation, and it's variants are since being
introduced in several other sequence-to-sequence problems. Problems which
involve a shared vocabulary, can benefit from the similar semantic and
syntactic structure in the source and target sentences. With the motivation of
building a reliable and fast post-processing textual module to assist all the
text-related use cases in mobile phones, we take on the popular spell
correction problem. In this paper, we propose multi encoder-single decoder
variation of conventional transformer. Outputs from the three encoders with
character level 1-gram, 2-grams and 3-grams inputs are attended in hierarchical
fashion in the decoder. The context vectors from the encoders clubbed with
self-attention amplify the n-gram properties at the character level and helps
in accurate decoding. We demonstrate our model on spell correction dataset from
Samsung Research, and report significant improvement of 0.11\%, 0.32\% and
0.69\% in character (CER), word (WER) and sentence (SER) error rates from
existing state-of-the-art machine-translation architectures. Our architecture
is also trains ~7.8 times faster, and is only about 1/3 in size from the next
most accurate model.
Related papers
- Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Sparsity and Sentence Structure in Encoder-Decoder Attention of
Summarization Systems [38.672160430296536]
Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization.
Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder.
This work focuses on the transformer's encoder-decoder attention mechanism.
arXiv Detail & Related papers (2021-09-08T19:32:42Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Contextual Transformer Networks for Visual Recognition [103.79062359677452]
We design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition.
Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix.
Our CoT block is appealing in the view that it can readily replace each $3times3$ convolution in ResNet architectures.
arXiv Detail & Related papers (2021-07-26T16:00:21Z) - Reinforcement Learning for on-line Sequence Transformation [0.0]
We introduce an architecture that learns with reinforcement to make decisions about whether to read a token or write another token.
In an experimental study we compare it with state-of-the-art methods for neural machine translation.
arXiv Detail & Related papers (2021-05-28T20:31:25Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - Efficient Wait-k Models for Simultaneous Machine Translation [46.01342928010307]
Simultaneous machine translation consists in starting output generation before the entire input sequence is available.
Wait-k decoders offer a simple but efficient approach for this problem.
We investigate the behavior of wait-k decoding in low resource settings for spoken corpora using IWSLT datasets.
arXiv Detail & Related papers (2020-05-18T11:14:23Z) - On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder.
The gates are regularized using the expected value of the sparsity-inducing L0penalty.
We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z) - Consistent Multiple Sequence Decoding [36.46573114422263]
We introduce a consistent multiple sequence decoding architecture.
This architecture allows for consistent and simultaneous decoding of an arbitrary number of sequences.
We show the efficacy of our consistent multiple sequence decoder on the task of dense relational image captioning.
arXiv Detail & Related papers (2020-04-02T00:43:54Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.