Syntax-guided Localized Self-attention by Constituency Syntactic
Distance
- URL: http://arxiv.org/abs/2210.11759v1
- Date: Fri, 21 Oct 2022 06:37:25 GMT
- Title: Syntax-guided Localized Self-attention by Constituency Syntactic
Distance
- Authors: Shengyuan Hou, Jushi Kai, Haotian Xue, Bingyu Zhu, Bo Yuan, Longtao
Huang, Xinbing Wang and Zhouhan Lin
- Abstract summary: We propose a syntax-guided localized self-attention for Transformer.
It allows incorporating directly grammar structures from an external constituency.
Experimental results show that our model could consistently improve translation performance.
- Score: 26.141356981833862
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent works have revealed that Transformers are implicitly learning the
syntactic information in its lower layers from data, albeit is highly dependent
on the quality and scale of the training data. However, learning syntactic
information from data is not necessary if we can leverage an external syntactic
parser, which provides better parsing quality with well-defined syntactic
structures. This could potentially improve Transformer's performance and sample
efficiency. In this work, we propose a syntax-guided localized self-attention
for Transformer that allows directly incorporating grammar structures from an
external constituency parser. It prohibits the attention mechanism to
overweight the grammatically distant tokens over close ones. Experimental
results show that our model could consistently improve translation performance
on a variety of machine translation datasets, ranging from small to large
dataset sizes, and with different source languages.
Related papers
- Structural Biases for Improving Transformers on Translation into
Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure.
The second method imbues structure at the data level by segmenting the data with morphological tokenization.
We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Transformer Grammars: Augmenting Transformer Language Models with
Syntactic Inductive Biases at Scale [31.293175512404172]
We introduce Transformer Grammars -- a class of Transformer language models that combine expressive power, scalability, and strong performance of Transformers.
We find that Transformer Grammars outperform various strong baselines on multiple syntax-sensitive language modeling evaluation metrics.
arXiv Detail & Related papers (2022-03-01T17:22:31Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Adapting Pretrained Transformer to Lattices for Spoken Language
Understanding [39.50831917042577]
It is shown that encoding lattices as opposed to 1-best results generated by automatic speech recognizer (ASR) boosts the performance of spoken language understanding (SLU)
This paper aims at adapting pretrained transformers to lattice inputs in order to perform understanding tasks specifically for spoken language.
arXiv Detail & Related papers (2020-11-02T07:14:34Z) - Learning Source Phrase Representations for Neural Machine Translation [65.94387047871648]
We propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations.
In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline.
arXiv Detail & Related papers (2020-06-25T13:43:11Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z) - Stacked DeBERT: All Attention in Incomplete Data for Text Classification [8.900866276512364]
We propose Stacked DeBERT, short for Stacked Denoising Bidirectional Representations from Transformers.
Our model shows improved F1-scores and better robustness in informal/incorrect texts present in tweets and in texts with Speech-to-Text error in sentiment and intent classification tasks.
arXiv Detail & Related papers (2020-01-01T04:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.