Related papers: On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies

On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies

URL: http://arxiv.org/abs/2104.05694v1
Date: Mon, 12 Apr 2021 17:55:27 GMT
Title: On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies
Authors: Tianyi Zhang and Tatsunori Hashimoto
Abstract summary: Masking and predicting tokens in an unsupervised fashion can give rise linguistic structures and downstream performance gains. Recent theories have suggested that pretrained language models acquire useful inductive biases through masks that implicitly act as cloze reductions. We show that the success of the random masking strategy used in practice cannot be explained by such cloze-like masks alone.
Score: 8.370942516424817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study how masking and predicting tokens in an unsupervised fashion can give rise to linguistic structures and downstream performance gains. Recent theories have suggested that pretrained language models acquire useful inductive biases through masks that implicitly act as cloze reductions for downstream tasks. While appealing, we show that the success of the random masking strategy used in practice cannot be explained by such cloze-like masks alone. We construct cloze-like masks using task-specific lexicons for three different classification datasets and show that the majority of pretrained performance gains come from generic masks that are not associated with the lexicon. To explain the empirical success of these generic masks, we demonstrate a correspondence between the Masked Language Model (MLM) objective and existing methods for learning statistical dependencies in graphical models. Using this, we derive a method for extracting these learned statistical dependencies in MLMs and show that these dependencies encode useful inductive biases in the form of syntactic structures. In an unsupervised parsing evaluation, simply forming a minimum spanning tree on the implied statistical dependence structure outperforms a classic method for unsupervised parsing (58.74 vs. 55.91 UUAS).

Related papers

Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks [50.53590930588431]
adversarial examples pose serious threats to natural language processing systems. Recent studies suggest that adversarial texts deviate from the underlying manifold of normal texts, whereas masked language models can approximate the manifold of normal data. We first introduce Masked Language Model-based Detection (MLMD), leveraging mask unmask operations of the masked language modeling (MLM) objective.
arXiv Detail & Related papers (2025-04-08T14:10:57Z)
Robust Infidelity: When Faithfulness Measures on Masked Language Models Are Misleading [5.124348720450654]
We show that iterative masking produces large variation in faithfulness scores between otherwise comparable Transformer encoder text classifiers. We explore task-specific considerations that undermine principled comparison of interpretability using iterative masking.
arXiv Detail & Related papers (2023-08-13T15:44:39Z)
Contextual Distortion Reveals Constituency: Masked Language Models are Implicit Parsers [7.558415495951758]
We propose a novel method for extracting parse trees from masked language models (LMs) Our method computes a score for each span based on the distortion of contextual representations resulting from linguistic perturbations. Our method consistently outperforms previous state-of-the-art methods on English with masked LMs, and also demonstrates superior performance in a multilingual setting.
arXiv Detail & Related papers (2023-06-01T13:10:48Z)
Self-Evolution Learning for Discriminative Language Model Pretraining [103.57103957631067]
Self-Evolution learning (SE) is a simple and effective token masking and learning method. SE focuses on learning the informative yet under-explored tokens and adaptively regularizes the training by introducing a novel Token-specific Label Smoothing approach.
arXiv Detail & Related papers (2023-05-24T16:00:54Z)
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations [86.47908754383198]
Open-Vocabulary (OV) methods leverage large-scale image-caption pairs and vision-language models to learn novel categories. Our method generates pseudo-mask annotations by leveraging the localization ability of a pre-trained vision-language model for objects present in image-caption pairs. Our method trained with just pseudo-masks significantly improves the mAP scores on the MS-COCO dataset and OpenImages dataset.
arXiv Detail & Related papers (2023-03-29T17:58:39Z)
Extreme Masking for Learning Instance and Distributed Visual Representations [50.152264456036114]
The paper presents a scalable approach for learning distributed representations over individual tokens and a holistic instance representation simultaneously. We use self-attention blocks to represent distributed tokens, followed by cross-attention blocks to aggregate the holistic instance. Our model, named ExtreMA, follows the plain BYOL approach where the instance representation from the unmasked subset is trained to predict that from the intact input.
arXiv Detail & Related papers (2022-06-09T17:59:43Z)
Comparing Text Representations: A Theory-Driven Approach [2.893558866535708]
We adapt general tools from computational learning theory to fit the specific characteristics of text datasets. We present a method to evaluate the compatibility between representations and tasks. This method provides a calibrated, quantitative measure of the difficulty of a classification-based NLP task.
arXiv Detail & Related papers (2021-09-15T17:48:19Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model [57.77981008219654]
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training. We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
arXiv Detail & Related papers (2020-10-12T21:28:14Z)
Masking as an Efficient Alternative to Finetuning for Pretrained Language Models [49.64561153284428]
We learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. In intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks.
arXiv Detail & Related papers (2020-04-26T15:03:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.