Pre-Training a Graph Recurrent Network for Language Representation
- URL: http://arxiv.org/abs/2209.03834v1
- Date: Thu, 8 Sep 2022 14:12:15 GMT
- Title: Pre-Training a Graph Recurrent Network for Language Representation
- Authors: Yile Wang, Linyi Yang, Zhiyang Teng, Ming Zhou, Yue Zhang
- Abstract summary: We consider a graph recurrent network for language model pre-training, which builds a graph structure for each sequence with local token-level communications.
We find that our model can generate more diverse outputs with less contextualized feature redundancy than existing attention-based models.
- Score: 34.4554387894105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based pre-trained models have gained much advance in recent
years, becoming one of the most important backbones in natural language
processing. Recent work shows that the attention mechanism inside Transformer
may not be necessary, both convolutional neural networks and multi-layer
perceptron based models have also been investigated as Transformer
alternatives. In this paper, we consider a graph recurrent network for language
model pre-training, which builds a graph structure for each sequence with local
token-level communications, together with a sentence-level representation
decoupled from other tokens. The original model performs well in
domain-specific text classification under supervised training, however, its
potential in learning transfer knowledge by self-supervised way has not been
fully exploited. We fill this gap by optimizing the architecture and verifying
its effectiveness in more general language understanding tasks, for both
English and Chinese languages. As for model efficiency, instead of the
quadratic complexity in Transformer-based models, our model has linear
complexity and performs more efficiently during inference. Moreover, we find
that our model can generate more diverse outputs with less contextualized
feature redundancy than existing attention-based models.
Related papers
- Linear Transformers with Learnable Kernel Functions are Better In-Context Models [3.3865605512957453]
We present an elegant alteration to the Based kernel that amplifies its In-Context Learning abilities.
In our work, we present a singular, elegant alteration to the Based kernel that amplifies its In-Context Learning abilities evaluated with the Multi-Query Associative Recall task.
arXiv Detail & Related papers (2024-02-16T12:44:15Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - N-Grammer: Augmenting Transformers with latent n-grams [35.39961549040385]
We propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence.
We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer.
arXiv Detail & Related papers (2022-07-13T17:18:02Z) - Dependency-based Mixture Language Models [53.152011258252315]
We introduce the Dependency-based Mixture Language Models.
In detail, we first train neural language models with a novel dependency modeling objective.
We then formulate the next-token probability by mixing the previous dependency modeling probability distributions with self-attention.
arXiv Detail & Related papers (2022-03-19T06:28:30Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Revisiting Simple Neural Probabilistic Language Models [27.957834093475686]
This paper revisits the neural probabilistic language model (NPLM) ofcitetBengio2003ANP.
When scaled up to modern hardware, this model performs much better than expected on word-level language model benchmarks.
Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM's local concatenation layer.
arXiv Detail & Related papers (2021-04-08T02:18:47Z) - GTAE: Graph-Transformer based Auto-Encoders for Linguistic-Constrained
Text Style Transfer [119.70961704127157]
Non-parallel text style transfer has attracted increasing research interests in recent years.
Current approaches still lack the ability to preserve the content and even logic of original sentences.
We propose a method called Graph Transformer based Auto-GTAE, which models a sentence as a linguistic graph and performs feature extraction and style transfer at the graph level.
arXiv Detail & Related papers (2021-02-01T11:08:45Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Learning Source Phrase Representations for Neural Machine Translation [65.94387047871648]
We propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations.
In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline.
arXiv Detail & Related papers (2020-06-25T13:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.