The Distributional Hypothesis Does Not Fully Explain the Benefits of
Masked Language Model Pretraining
- URL: http://arxiv.org/abs/2310.16261v1
- Date: Wed, 25 Oct 2023 00:31:29 GMT
- Title: The Distributional Hypothesis Does Not Fully Explain the Benefits of
Masked Language Model Pretraining
- Authors: Ting-Rui Chiang, Dani Yogatama
- Abstract summary: We investigate whether better sample efficiency and the better generalization capability of models pretrained with masked language modeling can be attributed to the semantic similarity encoded in the pretraining data's distributional property.
Our results illustrate our limited understanding of model pretraining and provide future research directions.
- Score: 27.144616560712493
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We analyze the masked language modeling pretraining objective function from
the perspective of the distributional hypothesis. We investigate whether better
sample efficiency and the better generalization capability of models pretrained
with masked language modeling can be attributed to the semantic similarity
encoded in the pretraining data's distributional property. Via a synthetic
dataset, our analysis suggests that distributional property indeed leads to the
better sample efficiency of pretrained masked language models, but does not
fully explain the generalization capability. We also conduct analyses over two
real-world datasets and demonstrate that the distributional property does not
explain the generalization ability of pretrained natural language models
either. Our results illustrate our limited understanding of model pretraining
and provide future research directions.
Related papers
- Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models [0.0]
We show that the perplexity of any large text produced by a language model must converge to the average entropy of its token distributions.
This work has possible practical applications for understanding and improving AI detection" tools.
arXiv Detail & Related papers (2024-05-22T16:23:40Z) - LMD3: Language Model Data Density Dependence [78.76731603461832]
We develop a methodology for analyzing language model task performance at the individual example level based on training data density estimation.
Experiments with paraphrasing as a controlled intervention on finetuning data demonstrate that increasing the support in the training distribution for specific test queries results in a measurable increase in density.
We conclude that our framework can provide statistical evidence of the dependence of a target model's predictions on subsets of its training data.
arXiv Detail & Related papers (2024-05-10T09:03:27Z) - On the Generalization of Diffusion Model [42.447639515467934]
We define the generalization of the generative model, which is measured by the mutual information between the generated data and the training set.
We show that for the empirical optimal diffusion model, the data generated by a deterministic sampler are all highly related to the training set, thus poor generalization.
We propose another training objective whose empirical optimal solution has no potential generalization problem.
arXiv Detail & Related papers (2023-05-24T04:27:57Z) - Explaining Language Models' Predictions with High-Impact Concepts [11.47612457613113]
We propose a complete framework for extending concept-based interpretability methods to NLP.
We optimize for features whose existence causes the output predictions to change substantially.
Our method achieves superior results on predictive impact, usability, and faithfulness compared to the baselines.
arXiv Detail & Related papers (2023-05-03T14:48:27Z) - Pathologies of Pre-trained Language Models in Few-shot Fine-tuning [50.3686606679048]
We show that pre-trained language models with few examples show strong prediction bias across labels.
Although few-shot fine-tuning can mitigate the prediction bias, our analysis shows models gain performance improvement by capturing non-task-related features.
These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior.
arXiv Detail & Related papers (2022-04-17T15:55:18Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Instance-Based Neural Dependency Parsing [56.63500180843504]
We develop neural models that possess an interpretable inference process for dependency parsing.
Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set.
arXiv Detail & Related papers (2021-09-28T05:30:52Z) - Information-theoretic Evolution of Model Agnostic Global Explanations [10.921146104622972]
We present a novel model-agnostic approach that derives rules to globally explain the behavior of classification models trained on numerical and/or categorical data.
Our approach has been deployed in a leading digital marketing suite of products.
arXiv Detail & Related papers (2021-05-14T16:52:16Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - Exploring Lexical Irregularities in Hypothesis-Only Models of Natural
Language Inference [5.283529004179579]
Natural Language Inference (NLI) or Recognizing Textual Entailment (RTE) is the task of predicting the entailment relation between a pair of sentences.
Models that understand entailment should encode both, the premise and the hypothesis.
Experiments by Poliak et al. revealed a strong preference of these models towards patterns observed only in the hypothesis.
arXiv Detail & Related papers (2021-01-19T01:08:06Z) - Video Prediction via Example Guidance [156.08546987158616]
In video prediction tasks, one major challenge is to capture the multi-modal nature of future contents and dynamics.
In this work, we propose a simple yet effective framework that can efficiently predict plausible future states.
arXiv Detail & Related papers (2020-07-03T14:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.