Related papers: Hard Non-Monotonic Attention for Character-Level Transduction

Hard Non-Monotonic Attention for Character-Level Transduction

URL: http://arxiv.org/abs/1808.10024v3
Date: Tue, 20 Feb 2024 15:36:05 GMT
Title: Hard Non-Monotonic Attention for Character-Level Transduction
Authors: Shijie Wu, Pamela Shapiro, Ryan Cotterell
Abstract summary: We introduce an exact, exponential-time algorithm for marginalizing over a number of non-monotonic alignments between two strings. We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the approximation and outperforms soft attention.
Score: 65.17388794270694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Character-level string-to-string transduction is an important component of various NLP tasks. The goal is to map an input string to an output string, where the strings may be of different lengths and have characters taken from different alphabets. Recent approaches have used sequence-to-sequence models with an attention mechanism to learn which parts of the input string the model should focus on during the generation of the output string. Both soft attention and hard monotonic attention have been used, but hard non-monotonic attention has only been used in other sequence modeling tasks such as image captioning (Xu et al., 2015), and has required a stochastic approximation to compute the gradient. In this work, we introduce an exact, polynomial-time algorithm for marginalizing over the exponential number of non-monotonic alignments between two strings, showing that hard attention models can be viewed as neural reparameterizations of the classical IBM Model 1. We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the stochastic approximation and outperforms soft attention. Code is available at https://github. com/shijie-wu/neural-transducer.

Related papers

Modality Agnostic Efficient Long Range Encoder [14.705955027331674]
We address the challenge of long-context processing on a single device using generic implementations.<n>To overcome these limitations, we propose MAELRE, a unified and efficient transformer architecture.<n>We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models.
arXiv Detail & Related papers (2025-07-25T16:19:47Z)
Sequential-Parallel Duality in Prefix Scannable Models [68.39855814099997]
Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba.<n>This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference?
arXiv Detail & Related papers (2025-06-12T17:32:02Z)
Log-Linear Attention [81.09631871212211]
This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention.<n>We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length.<n>Log-linear attention is a general framework and can be applied on top of existing linear attention variants.
arXiv Detail & Related papers (2025-06-05T08:44:51Z)
Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning [20.51822826798248]
We propose to scale up the attention field by tensorizing long input sequences into compact tensor representations followed by attention on each transformed dimension. We show that the proposed attention tensorization encodes token dependencies as a multi-hop attention process, and is equivalent to Kronecker decomposition of full attention.
arXiv Detail & Related papers (2024-10-28T11:08:57Z)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module. We identify that a key weakness of such models is their inability to perform content-based reasoning. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba) As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z)
Toeplitz Neural Network for Sequence Modeling [46.04964190407727]
We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear. A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters. Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
arXiv Detail & Related papers (2023-05-08T14:49:01Z)
ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths [9.205331586765613]
We propose a simple neural network building block called ChordMixer which can model the attention for long sequences with variable lengths. Repeatedly applying such blocks forms an effective network backbone that mixes the input signals towards the learning targets.
arXiv Detail & Related papers (2022-06-12T22:39:41Z)
cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops. We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
Sparse Sinkhorn Attention [93.88158993722716]
We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. We introduce a meta sorting network that learns to generate latent permutations over sequences. Given sorted sequences, we are then able to compute quasi-global attention with only local windows.
arXiv Detail & Related papers (2020-02-26T04:18:01Z)
Exact Hard Monotonic Attention for Character-Level Transduction [76.66797368985453]
We show that neural sequence-to-sequence models that use non-monotonic soft attention often outperform popular monotonic models. We develop a hard attention sequence-to-sequence model that enforces strict monotonicity and learns a latent alignment jointly while learning to transduce.
arXiv Detail & Related papers (2019-05-15T17:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.