On Biasing Transformer Attention Towards Monotonicity
- URL: http://arxiv.org/abs/2104.03945v1
- Date: Thu, 8 Apr 2021 17:42:05 GMT
- Title: On Biasing Transformer Attention Towards Monotonicity
- Authors: Annette Rios, Chantal Amrhein, No\"emi Aepli, Rico Sennrich
- Abstract summary: We introduce a monotonicity loss function that is compatible with standard attention mechanisms and test it on several sequence-to-sequence tasks.
Experiments show that we can achieve largely monotonic behavior.
General monotonicity does not benefit transformer multihead attention, however, we see isolated improvements when only a subset of heads is biased towards monotonic behavior.
- Score: 20.205388243570003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many sequence-to-sequence tasks in natural language processing are roughly
monotonic in the alignment between source and target sequence, and previous
work has facilitated or enforced learning of monotonic attention behavior via
specialized attention functions or pretraining. In this work, we introduce a
monotonicity loss function that is compatible with standard attention
mechanisms and test it on several sequence-to-sequence tasks:
grapheme-to-phoneme conversion, morphological inflection, transliteration, and
dialect normalization. Experiments show that we can achieve largely monotonic
behavior. Performance is mixed, with larger gains on top of RNN baselines.
General monotonicity does not benefit transformer multihead attention, however,
we see isolated improvements when only a subset of heads is biased towards
monotonic behavior.
Related papers
- How to address monotonicity for model risk management? [1.0878040851638]
This paper studies transparent neural networks in the presence of three types of monotonicity: individual monotonicity, weak pairwise monotonicity, and strong pairwise monotonicity.
As a means of achieving monotonicity while maintaining transparency, we propose the monotonic groves of neural additive models.
arXiv Detail & Related papers (2023-04-28T04:21:02Z) - Constrained Monotonic Neural Networks [0.685316573653194]
Wider adoption of neural networks in many critical domains such as finance and healthcare is being hindered by the need to explain their predictions.
Monotonicity constraint is one of the most requested properties in real-world scenarios.
We show it can approximate any continuous monotone function on a compact subset of $mathbbRn$.
arXiv Detail & Related papers (2022-05-24T04:26:10Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning.
We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z) - A study of latent monotonic attention variants [65.73442960456013]
End-to-end models reach state-of-the-art performance for speech recognition, but global soft attention is not monotonic.
We present a mathematically clean solution to introduce monotonicity, by introducing a new latent variable.
We show that our monotonic models perform as good as the global soft attention model.
arXiv Detail & Related papers (2021-03-30T22:35:56Z) - Attention is Not All You Need: Pure Attention Loses Rank Doubly
Exponentially with Depth [48.16156149749371]
This work proposes a new way to understand self-attention networks.
We show that their output can be decomposed into a sum of smaller terms.
We prove that self-attention possesses a strong inductive bias towards "token"
arXiv Detail & Related papers (2021-03-05T00:39:05Z) - Counterexample-Guided Learning of Monotonic Neural Networks [32.73558242733049]
We focus on monotonicity constraints, which are common and require that the function's output increases with increasing values of specific input features.
We develop a counterexample-guided technique to provably enforce monotonicity constraints at prediction time.
We also propose a technique to use monotonicity as an inductive bias for deep learning.
arXiv Detail & Related papers (2020-06-16T01:04:26Z) - Quantum monotone metrics induced from trace non-increasing maps and
additive noise [0.0]
We introduce another extension of quantum monotone metrics which have monotonicity under completely positive, trace non-increasing (CPTNI) maps and additive noise.
We show that our monotone metrics have some natural properties such as additivity of direct sum, convexity and monotonicity with respect to positive operators.
arXiv Detail & Related papers (2020-06-10T09:09:50Z) - Exact Hard Monotonic Attention for Character-Level Transduction [76.66797368985453]
We show that neural sequence-to-sequence models that use non-monotonic soft attention often outperform popular monotonic models.
We develop a hard attention sequence-to-sequence model that enforces strict monotonicity and learns a latent alignment jointly while learning to transduce.
arXiv Detail & Related papers (2019-05-15T17:51:09Z) - Hard Non-Monotonic Attention for Character-Level Transduction [65.17388794270694]
We introduce an exact, exponential-time algorithm for marginalizing over a number of non-monotonic alignments between two strings.
We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the approximation and outperforms soft attention.
arXiv Detail & Related papers (2018-08-29T20:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.