Do Transformers Parse while Predicting the Masked Word?
- URL: http://arxiv.org/abs/2303.08117v2
- Date: Mon, 16 Oct 2023 03:27:53 GMT
- Title: Do Transformers Parse while Predicting the Masked Word?
- Authors: Haoyu Zhao, Abhishek Panigrahi, Rong Ge, Sanjeev Arora
- Abstract summary: Some doubts have been raised whether pre-trained language models actually are doing parsing.
This paper takes a step toward answering these questions in the context of generative modeling with PCFGs.
- Score: 48.65553369481289
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models have been shown to encode linguistic structures,
e.g. dependency and constituency parse trees, in their embeddings while being
trained on unsupervised loss functions like masked language modeling. Some
doubts have been raised whether the models actually are doing parsing or only
some computation weakly correlated with it. We study questions: (a) Is it
possible to explicitly describe transformers with realistic embedding
dimension, number of heads, etc. that are capable of doing parsing -- or even
approximate parsing? (b) Why do pre-trained models capture parsing structure?
This paper takes a step toward answering these questions in the context of
generative modeling with PCFGs. We show that masked language models like BERT
or RoBERTa of moderate sizes can approximately execute the Inside-Outside
algorithm for the English PCFG [Marcus et al, 1993]. We also show that the
Inside-Outside algorithm is optimal for masked language modeling loss on the
PCFG-generated data. We also give a construction of transformers with $50$
layers, $15$ attention heads, and $1275$ dimensional embeddings in average such
that using its embeddings it is possible to do constituency parsing with
$>70\%$ F1 score on PTB dataset. We conduct probing experiments on models
pre-trained on PCFG-generated data to show that this not only allows recovery
of approximate parse tree, but also recovers marginal span probabilities
computed by the Inside-Outside algorithm, which suggests an implicit bias of
masked language modeling towards this algorithm.
Related papers
- Contextual Distortion Reveals Constituency: Masked Language Models are
Implicit Parsers [7.558415495951758]
We propose a novel method for extracting parse trees from masked language models (LMs)
Our method computes a score for each span based on the distortion of contextual representations resulting from linguistic perturbations.
Our method consistently outperforms previous state-of-the-art methods on English with masked LMs, and also demonstrates superior performance in a multilingual setting.
arXiv Detail & Related papers (2023-06-01T13:10:48Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary.
We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus.
NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z) - Characterizing Intrinsic Compositionality in Transformers with Tree
Projections [72.45375959893218]
neural models like transformers can route information arbitrarily between different parts of their input.
We show that transformers for three different tasks become more treelike over the course of training.
These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
arXiv Detail & Related papers (2022-11-02T17:10:07Z) - Unsupervised and Few-shot Parsing from Pretrained Language Models [56.33247845224995]
We propose an Unsupervised constituent Parsing model that calculates an Out Association score solely based on the self-attention weight matrix learned in a pretrained language model.
We extend the unsupervised models to few-shot parsing models that use a few annotated trees to learn better linear projection matrices for parsing.
Our few-shot parsing model FPIO trained with only 20 annotated trees outperforms a previous few-shot parsing method trained with 50 annotated trees.
arXiv Detail & Related papers (2022-06-10T10:29:15Z) - The Limitations of Limited Context for Constituency Parsing [27.271792317099045]
Parsing-Reading-Predict architecture of (Shen et al., 2018a) was first to perform unsupervised syntactic parsing.
What kind of syntactic structure can current neural approaches to syntax represent?
We ground this question in the sandbox of probabilistic-free-grammars (PCFGs)
We identify a key aspect of the representational power of these approaches: the amount and directionality of context that the predictor has access to.
arXiv Detail & Related papers (2021-06-03T03:58:35Z) - Heads-up! Unsupervised Constituency Parsing via Self-Attention Heads [27.578115452635625]
We propose a novel fully unsupervised parsing approach that extracts constituency trees from PLM attention heads.
We rank transformer attention heads based on their inherent properties, and create an ensemble of high-ranking heads to produce the final tree.
Our experiments can also be used as a tool to analyze the grammars PLMs learn implicitly.
arXiv Detail & Related papers (2020-10-19T13:51:40Z) - Latent Tree Learning with Ordered Neurons: What Parses Does It Produce? [2.025491206574996]
latent tree learning models can learn constituency parsing without exposure to human-annotated tree structures.
ON-LSTM is trained on language modelling and has near-state-of-the-art performance on unsupervised parsing.
We replicate the model with different restarts and examine their parses.
arXiv Detail & Related papers (2020-10-10T07:12:48Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.