Improving BERT Pretraining with Syntactic Supervision
- URL: http://arxiv.org/abs/2104.10516v1
- Date: Wed, 21 Apr 2021 13:15:58 GMT
- Title: Improving BERT Pretraining with Syntactic Supervision
- Authors: Giorgos Tziafas, Konstantinos Kogkalidis, Gijs Wijnholds, Michael
Moortgat
- Abstract summary: Bidirectional masked Transformers have become the core theme in the current NLP landscape.
We apply our methodology on Lassy Large, an automatically annotated corpus of written Dutch.
Our experiments suggest that our syntax-aware model performs on par with established baselines.
- Score: 2.4087148947930634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bidirectional masked Transformers have become the core theme in the current
NLP landscape. Despite their impressive benchmarks, a recurring theme in recent
research has been to question such models' capacity for syntactic
generalization. In this work, we seek to address this question by adding a
supervised, token-level supertagging objective to standard unsupervised
pretraining, enabling the explicit incorporation of syntactic biases into the
network's training dynamics. Our approach is straightforward to implement,
induces a marginal computational overhead and is general enough to adapt to a
variety of settings. We apply our methodology on Lassy Large, an automatically
annotated corpus of written Dutch. Our experiments suggest that our
syntax-aware model performs on par with established baselines, despite Lassy
Large being one order of magnitude smaller than commonly used corpora.
Related papers
- Learning and Transferring Sparse Contextual Bigrams with Linear Transformers [47.37256334633102]
We introduce the Sparse Con Bigram model, where the next token's generation depends on a sparse set of earlier positions determined by the last token.
We analyze the training dynamics and sample complexity of learning SCB using a one-layer linear transformer with a gradient-based algorithm.
We prove that, provided a nontrivial correlation between the downstream and pretraining tasks, finetuning from a pretrained model allows us to bypass the initial sample-intensive stage.
arXiv Detail & Related papers (2024-10-30T20:29:10Z) - Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes.
Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts.
We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z) - Accurate Neural Network Pruning Requires Rethinking Sparse Optimization [87.90654868505518]
We show the impact of high sparsity on model training using the standard computer vision and natural language processing sparsity benchmarks.
We provide new approaches for mitigating this issue for both sparse pre-training of vision models and sparse fine-tuning of language models.
arXiv Detail & Related papers (2023-08-03T21:49:14Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it.
Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z) - Compositional generalization in semantic parsing with pretrained
transformers [13.198689566654108]
We show that language models pretrained exclusively with non-English corpora, or even with programming language corpora, significantly improve out-of-distribution generalization.
We also show that larger models are harder to train from scratch and their generalization accuracy is lower when trained up to convergence.
arXiv Detail & Related papers (2021-09-30T13:06:29Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.