Inconsistencies in Masked Language Models
- URL: http://arxiv.org/abs/2301.00068v3
- Date: Fri, 23 Feb 2024 05:08:58 GMT
- Title: Inconsistencies in Masked Language Models
- Authors: Tom Young, Yunan Chen, Yang You
- Abstract summary: Masked language models (MLMs) can provide distributions of tokens in the masked positions in a sequence.
distributions corresponding to different masking patterns can demonstrate considerable inconsistencies.
We propose an inference-time strategy for fors called Ensemble of Conditionals.
- Score: 20.320583166619528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning to predict masked tokens in a sequence has been shown to be a
helpful pretraining objective for powerful language models such as PaLM2. After
training, such masked language models (MLMs) can provide distributions of
tokens in the masked positions in a sequence. However, this paper shows that
distributions corresponding to different masking patterns can demonstrate
considerable inconsistencies, i.e., they cannot be derived from a coherent
joint distribution when considered together.
This fundamental flaw in MLMs can lead to self-contradictory behaviors during
inference. On various benchmark datasets including MMLU, MLMs can give
different predictions to the same input question. From BERT-base to UL2-20B, we
show that such inconsistencies exist ubiquitously in MLMs of diverse sizes and
configurations. In light of our observations, we further propose an
inference-time strategy for MLMs called Ensemble of Conditionals. It jointly
considers a selected range of inconsistent conditionals directly produced by
the MLM for the final prediction, which often leads to considerable accuracy
improvement.
Related papers
- Set-Based Prompting: Provably Solving the Language Model Order Dependency Problem [18.020492646988746]
We present Set-Based Prompting, a technique that guarantees the output of an LLM will not have order dependence on a specified set of sub-sequences.
Despite our inputs being out of distribution, the impact on expected accuracy is small, where the expectation is over the order of uniformly chosen shuffling of the candidate responses.
arXiv Detail & Related papers (2024-06-04T16:09:13Z) - Towards Probabilistically-Sound Beam Search with Masked Language Models [0.0]
Beam search masked language models (MLMs) is challenging in part because joint probability over distributions are not available.
estimating such distributions has important domain-specific applications such as ancient text restoration and protein engineering.
Here we present probabilistically-sound methods for beam search with sequences.
arXiv Detail & Related papers (2024-02-22T23:36:26Z) - Which Syntactic Capabilities Are Statistically Learned by Masked
Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities.
To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z) - Deriving Language Models from Masked Language Models [12.628196757545979]
Masked language models (MLM) do not explicitly define a distribution over language.
Recent work has implicitly treated them as such for the purposes of generation and scoring.
arXiv Detail & Related papers (2023-05-24T18:42:45Z) - Representation Deficiency in Masked Language Modeling [107.39136254013042]
We propose MAE-LM, which pretrains the Masked Autoencoder architecture with where $tt[MASK]$ tokens are excluded from the encoder.
We show that MAE-LM consistently outperforms pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
arXiv Detail & Related papers (2023-02-04T01:54:17Z) - Transcormer: Transformer for Sentence Scoring with Sliding Language
Modeling [95.9542389945259]
Sentence scoring aims at measuring the likelihood of a sentence and is widely used in many natural language processing scenarios.
We propose textitTranscormer -- a Transformer model with a novel textitsliding language modeling (SLM) for sentence scoring.
arXiv Detail & Related papers (2022-05-25T18:00:09Z) - Exposing the Implicit Energy Networks behind Masked Language Models via
Metropolis--Hastings [57.133639209759615]
We interpret sequences as energy-based sequence models and propose two energy parametrizations derivable from traineds.
We develop a tractable emph scheme based on the Metropolis-Hastings Monte Carlo algorithm.
We validate the effectiveness of the proposed parametrizations by exploring the quality of samples drawn from these energy-based models.
arXiv Detail & Related papers (2021-06-04T22:04:30Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - MLMLM: Link Prediction with Mean Likelihood Masked Language Model [14.672283581769774]
Knowledge Bases (KBs) are easy query, verifiable, and interpretable.
Masked Models (MLMs), such as BERT, scale with computing power as well as raw text data.
We introduce Mean Likelihood Masked Language Model, an approach comparing mean likelihood of generating different entities to perform link prediction.
arXiv Detail & Related papers (2020-09-15T13:11:13Z) - Encoder-Decoder Models Can Benefit from Pre-trained Masked Language
Models in Grammatical Error Correction [54.569707226277735]
Previous methods have potential drawbacks when applied to an EncDec model.
Our proposed method fine-tune a corpus and then use the output fine-tuned as additional features in the GEC model.
The best-performing model state-of-the-art performances on the BEA 2019 and CoNLL-2014 benchmarks.
arXiv Detail & Related papers (2020-05-03T04:49:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.