Related papers: Syntactic Structure Distillation Pretraining For Bidirectional Encoders

Syntactic Structure Distillation Pretraining For Bidirectional Encoders

URL: http://arxiv.org/abs/2005.13482v1
Date: Wed, 27 May 2020 16:44:01 GMT
Title: Syntactic Structure Distillation Pretraining For Bidirectional Encoders
Authors: Adhiguna Kuncoro, Lingpeng Kong, Daniel Fried, Dani Yogatama, Laura Rimell, Chris Dyer, Phil Blunsom
Abstract summary: We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining. We distill the approximate marginal distribution over words in context from the syntactic LM. Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
Score: 49.483357228441434
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Textual representation learners trained on large amounts of data have achieved notable success on downstream tasks; intriguingly, they have also performed well on challenging tests of syntactic competence. Given this success, it remains an open question whether scalable learners like BERT can become fully proficient in the syntax of natural language by virtue of data scale alone, or whether they still benefit from more explicit syntactic biases. To answer this question, we introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining, by distilling the syntactically informative predictions of a hierarchical---albeit harder to scale---syntactic language model. Since BERT models masked words in bidirectional context, we propose to distill the approximate marginal distribution over words in context from the syntactic LM. Our approach reduces relative error by 2-21% on a diverse set of structured prediction tasks, although we obtain mixed results on the GLUE benchmark. Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data, and contribute to a better understanding of where syntactic biases are most helpful in benchmarks of natural language understanding.

Related papers

Assessing Language Comprehension in Large Language Models Using Construction Grammar [3.0906699069248806]
Construction Grammar (CxG) provides insights into the meaning captured by linguistic elements known as constructions (Cxns) These datasets are carefully constructed to include examples which are unlikely to appear in pre-training data, yet intuitive and easy for humans to understand. Our experiments focus on downstream natural language inference and reasoning tasks by comparing LLMs' understanding of the underlying meanings communicated through 8 unique Cxns with that of humans.
arXiv Detail & Related papers (2025-01-08T18:15:10Z)
Contrastive Learning of Sentence Embeddings from Scratch [26.002876719243464]
We present SynCSE, a contrastive learning framework that trains sentence embeddings with synthesized data. Specifically, we explore utilizing large language models to synthesize the required data samples for contrastive learning. Experimental results on sentence similarity and reranking tasks indicate that both SynCSE-partial and SynCSE-scratch greatly outperform unsupervised baselines.
arXiv Detail & Related papers (2023-05-24T11:56:21Z)
Deep Semi-supervised Learning with Double-Contrast of Features and Semantics [2.2230089845369094]
This paper proposes an end-to-end deep semi-supervised learning double contrast of semantic and feature. We leverage information theory to explain the rationality of double contrast of semantics and features.
arXiv Detail & Related papers (2022-11-28T09:08:19Z)
Does BERT really agree ? Fine-grained Analysis of Lexical Dependence on a Syntactic Task [70.29624135819884]
We study the extent to which BERT is able to perform lexically-independent subject-verb number agreement (NA) on targeted syntactic templates. Our results on nonce sentences suggest that the model generalizes well for simple templates, but fails to perform lexically-independent syntactic generalization when as little as one attractor is present.
arXiv Detail & Related papers (2022-04-14T11:33:15Z)
Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data. We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data. Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text. Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z)
On the Sentence Embeddings from Pre-trained Language Models [78.45172445684126]
In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity. We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
arXiv Detail & Related papers (2020-11-02T13:14:57Z)
GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight Gated Injection Method [29.352569563032056]
We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into a pre-trained BERT. Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model.
arXiv Detail & Related papers (2020-10-23T17:00:26Z)
Analysis and Evaluation of Language Models for Word Sense Disambiguation [18.001457030065712]
Transformer-based language models have taken many fields in NLP by storm. BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. BERT and its derivatives dominate most of the existing evaluation benchmarks.
arXiv Detail & Related papers (2020-08-26T15:07:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.