Related papers: Do Transformers Parse while Predicting the Masked Word?

Do Transformers Parse while Predicting the Masked Word?

URL: http://arxiv.org/abs/2303.08117v2
Date: Mon, 16 Oct 2023 03:27:53 GMT
Title: Do Transformers Parse while Predicting the Masked Word?
Authors: Haoyu Zhao, Abhishek Panigrahi, Rong Ge, Sanjeev Arora
Abstract summary: Some doubts have been raised whether pre-trained language models actually are doing parsing. This paper takes a step toward answering these questions in the context of generative modeling with PCFGs.
Score: 48.65553369481289
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained language models have been shown to encode linguistic structures, e.g. dependency and constituency parse trees, in their embeddings while being trained on unsupervised loss functions like masked language modeling. Some doubts have been raised whether the models actually are doing parsing or only some computation weakly correlated with it. We study questions: (a) Is it possible to explicitly describe transformers with realistic embedding dimension, number of heads, etc. that are capable of doing parsing -- or even approximate parsing? (b) Why do pre-trained models capture parsing structure? This paper takes a step toward answering these questions in the context of generative modeling with PCFGs. We show that masked language models like BERT or RoBERTa of moderate sizes can approximately execute the Inside-Outside algorithm for the English PCFG [Marcus et al, 1993]. We also show that the Inside-Outside algorithm is optimal for masked language modeling loss on the PCFG-generated data. We also give a construction of transformers with $50$ layers, $15$ attention heads, and $1275$ dimensional embeddings in average such that using its embeddings it is possible to do constituency parsing with $>70\%$ F1 score on PTB dataset. We conduct probing experiments on models pre-trained on PCFG-generated data to show that this not only allows recovery of approximate parse tree, but also recovers marginal span probabilities computed by the Inside-Outside algorithm, which suggests an implicit bias of masked language modeling towards this algorithm.

Related papers

Contextual Distortion Reveals Constituency: Masked Language Models are Implicit Parsers [7.558415495951758]
We propose a novel method for extracting parse trees from masked language models (LMs) Our method computes a score for each span based on the distortion of contextual representations resulting from linguistic perturbations. Our method consistently outperforms previous state-of-the-art methods on English with masked LMs, and also demonstrates superior performance in a multilingual setting.
arXiv Detail & Related papers (2023-06-01T13:10:48Z)
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions. Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z)
Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary. We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z)
Characterizing Intrinsic Compositionality in Transformers with Tree Projections [72.45375959893218]
neural models like transformers can route information arbitrarily between different parts of their input. We show that transformers for three different tasks become more treelike over the course of training. These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
arXiv Detail & Related papers (2022-11-02T17:10:07Z)
Unsupervised and Few-shot Parsing from Pretrained Language Models [56.33247845224995]
We propose an Unsupervised constituent Parsing model that calculates an Out Association score solely based on the self-attention weight matrix learned in a pretrained language model. We extend the unsupervised models to few-shot parsing models that use a few annotated trees to learn better linear projection matrices for parsing. Our few-shot parsing model FPIO trained with only 20 annotated trees outperforms a previous few-shot parsing method trained with 50 annotated trees.
arXiv Detail & Related papers (2022-06-10T10:29:15Z)
The Limitations of Limited Context for Constituency Parsing [27.271792317099045]
Parsing-Reading-Predict architecture of (Shen et al., 2018a) was first to perform unsupervised syntactic parsing. What kind of syntactic structure can current neural approaches to syntax represent? We ground this question in the sandbox of probabilistic-free-grammars (PCFGs) We identify a key aspect of the representational power of these approaches: the amount and directionality of context that the predictor has access to.
arXiv Detail & Related papers (2021-06-03T03:58:35Z)
Heads-up! Unsupervised Constituency Parsing via Self-Attention Heads [27.578115452635625]
We propose a novel fully unsupervised parsing approach that extracts constituency trees from PLM attention heads. We rank transformer attention heads based on their inherent properties, and create an ensemble of high-ranking heads to produce the final tree. Our experiments can also be used as a tool to analyze the grammars PLMs learn implicitly.
arXiv Detail & Related papers (2020-10-19T13:51:40Z)
Latent Tree Learning with Ordered Neurons: What Parses Does It Produce? [2.025491206574996]
latent tree learning models can learn constituency parsing without exposure to human-annotated tree structures. ON-LSTM is trained on language modelling and has near-state-of-the-art performance on unsupervised parsing. We replicate the model with different restarts and examine their parses.
arXiv Detail & Related papers (2020-10-10T07:12:48Z)
Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters. We infer the posteriors over such latent variables based on data from seen task-language combinations. Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.