Syntactic Structure Distillation Pretraining For Bidirectional Encoders
- URL: http://arxiv.org/abs/2005.13482v1
- Date: Wed, 27 May 2020 16:44:01 GMT
- Title: Syntactic Structure Distillation Pretraining For Bidirectional Encoders
- Authors: Adhiguna Kuncoro, Lingpeng Kong, Daniel Fried, Dani Yogatama, Laura
Rimell, Chris Dyer, Phil Blunsom
- Abstract summary: We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining.
We distill the approximate marginal distribution over words in context from the syntactic LM.
Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
- Score: 49.483357228441434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Textual representation learners trained on large amounts of data have
achieved notable success on downstream tasks; intriguingly, they have also
performed well on challenging tests of syntactic competence. Given this
success, it remains an open question whether scalable learners like BERT can
become fully proficient in the syntax of natural language by virtue of data
scale alone, or whether they still benefit from more explicit syntactic biases.
To answer this question, we introduce a knowledge distillation strategy for
injecting syntactic biases into BERT pretraining, by distilling the
syntactically informative predictions of a hierarchical---albeit harder to
scale---syntactic language model. Since BERT models masked words in
bidirectional context, we propose to distill the approximate marginal
distribution over words in context from the syntactic LM. Our approach reduces
relative error by 2-21% on a diverse set of structured prediction tasks,
although we obtain mixed results on the GLUE benchmark. Our findings
demonstrate the benefits of syntactic biases, even in representation learners
that exploit large amounts of data, and contribute to a better understanding of
where syntactic biases are most helpful in benchmarks of natural language
understanding.
Related papers
- Contrastive Learning of Sentence Embeddings from Scratch [26.002876719243464]
We present SynCSE, a contrastive learning framework that trains sentence embeddings with synthesized data.
Specifically, we explore utilizing large language models to synthesize the required data samples for contrastive learning.
Experimental results on sentence similarity and reranking tasks indicate that both SynCSE-partial and SynCSE-scratch greatly outperform unsupervised baselines.
arXiv Detail & Related papers (2023-05-24T11:56:21Z) - Deep Semi-supervised Learning with Double-Contrast of Features and
Semantics [2.2230089845369094]
This paper proposes an end-to-end deep semi-supervised learning double contrast of semantic and feature.
We leverage information theory to explain the rationality of double contrast of semantics and features.
arXiv Detail & Related papers (2022-11-28T09:08:19Z) - Does BERT really agree ? Fine-grained Analysis of Lexical Dependence on
a Syntactic Task [70.29624135819884]
We study the extent to which BERT is able to perform lexically-independent subject-verb number agreement (NA) on targeted syntactic templates.
Our results on nonce sentences suggest that the model generalizes well for simple templates, but fails to perform lexically-independent syntactic generalization when as little as one attractor is present.
arXiv Detail & Related papers (2022-04-14T11:33:15Z) - Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data.
We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data.
Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - ERICA: Improving Entity and Relation Understanding for Pre-trained
Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text.
Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z) - On the Sentence Embeddings from Pre-trained Language Models [78.45172445684126]
In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited.
We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity.
We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
arXiv Detail & Related papers (2020-11-02T13:14:57Z) - GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight
Gated Injection Method [29.352569563032056]
We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into a pre-trained BERT.
Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model.
arXiv Detail & Related papers (2020-10-23T17:00:26Z) - Analysis and Evaluation of Language Models for Word Sense Disambiguation [18.001457030065712]
Transformer-based language models have taken many fields in NLP by storm.
BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense.
BERT and its derivatives dominate most of the existing evaluation benchmarks.
arXiv Detail & Related papers (2020-08-26T15:07:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.