Hard Non-Monotonic Attention for Character-Level Transduction
- URL: http://arxiv.org/abs/1808.10024v3
- Date: Tue, 20 Feb 2024 15:36:05 GMT
- Title: Hard Non-Monotonic Attention for Character-Level Transduction
- Authors: Shijie Wu, Pamela Shapiro, Ryan Cotterell
- Abstract summary: We introduce an exact, exponential-time algorithm for marginalizing over a number of non-monotonic alignments between two strings.
We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the approximation and outperforms soft attention.
- Score: 65.17388794270694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Character-level string-to-string transduction is an important component of
various NLP tasks. The goal is to map an input string to an output string,
where the strings may be of different lengths and have characters taken from
different alphabets. Recent approaches have used sequence-to-sequence models
with an attention mechanism to learn which parts of the input string the model
should focus on during the generation of the output string. Both soft attention
and hard monotonic attention have been used, but hard non-monotonic attention
has only been used in other sequence modeling tasks such as image captioning
(Xu et al., 2015), and has required a stochastic approximation to compute the
gradient. In this work, we introduce an exact, polynomial-time algorithm for
marginalizing over the exponential number of non-monotonic alignments between
two strings, showing that hard attention models can be viewed as neural
reparameterizations of the classical IBM Model 1. We compare soft and hard
non-monotonic attention experimentally and find that the exact algorithm
significantly improves performance over the stochastic approximation and
outperforms soft attention. Code is available at https://github.
com/shijie-wu/neural-transducer.
Related papers
- Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning [20.51822826798248]
We propose to scale up the attention field by tensorizing long input sequences into compact tensor representations followed by attention on each transformed dimension.
We show that the proposed attention tensorization encodes token dependencies as a multi-hop attention process, and is equivalent to Kronecker decomposition of full attention.
arXiv Detail & Related papers (2024-10-28T11:08:57Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - Toeplitz Neural Network for Sequence Modeling [46.04964190407727]
We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear.
A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters.
Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
arXiv Detail & Related papers (2023-05-08T14:49:01Z) - ChordMixer: A Scalable Neural Attention Model for Sequences with
Different Lengths [9.205331586765613]
We propose a simple neural network building block called ChordMixer which can model the attention for long sequences with variable lengths.
Repeatedly applying such blocks forms an effective network backbone that mixes the input signals towards the learning targets.
arXiv Detail & Related papers (2022-06-12T22:39:41Z) - cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator.
Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops.
We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z) - Sparse Sinkhorn Attention [93.88158993722716]
We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend.
We introduce a meta sorting network that learns to generate latent permutations over sequences.
Given sorted sequences, we are then able to compute quasi-global attention with only local windows.
arXiv Detail & Related papers (2020-02-26T04:18:01Z) - Exact Hard Monotonic Attention for Character-Level Transduction [76.66797368985453]
We show that neural sequence-to-sequence models that use non-monotonic soft attention often outperform popular monotonic models.
We develop a hard attention sequence-to-sequence model that enforces strict monotonicity and learns a latent alignment jointly while learning to transduce.
arXiv Detail & Related papers (2019-05-15T17:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.