On Masked Pre-training and the Marginal Likelihood
- URL: http://arxiv.org/abs/2306.00520v1
- Date: Thu, 1 Jun 2023 10:20:44 GMT
- Title: On Masked Pre-training and the Marginal Likelihood
- Authors: Pablo Moreno-Mu\~noz, Pol G. Recasens and S{\o}ren Hauberg
- Abstract summary: Masked pre-training removes random input dimensions and learns a model that can predict the missing values.
This paper shows that masked pre-training with a suitable cumulative scoring function corresponds to maximizing the model's marginal likelihood.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked pre-training removes random input dimensions and learns a model that
can predict the missing values. Empirical results indicate that this intuitive
form of self-supervised learning yields models that generalize very well to new
domains. A theoretical understanding is, however, lacking. This paper shows
that masked pre-training with a suitable cumulative scoring function
corresponds to maximizing the model's marginal likelihood, which is de facto
the Bayesian model selection measure of generalization. Beyond shedding light
on the success of masked pre-training, this insight also suggests that Bayesian
models can be trained with appropriately designed self-supervision.
Empirically, we confirm the developed theory and explore the main learning
principles of masked pre-training in large language models.
Related papers
- Quantifying the Prediction Uncertainty of Machine Learning Models for Individual Data [2.1248439796866228]
This study investigates pNML's learnability for linear regression and neural networks.
It demonstrates that pNML can improve the performance and robustness of these models on various tasks.
arXiv Detail & Related papers (2024-12-10T13:58:19Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - On the Generalization Ability of Unsupervised Pretraining [53.06175754026037]
Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization.
This paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase.
Our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
arXiv Detail & Related papers (2024-03-11T16:23:42Z) - Predictive Churn with the Set of Good Models [64.05949860750235]
We study the effect of conflicting predictions over the set of near-optimal machine learning models.
We present theoretical results on the expected churn between models within the Rashomon set.
We show how our approach can be used to better anticipate, reduce, and avoid churn in consumer-facing applications.
arXiv Detail & Related papers (2024-02-12T16:15:25Z) - The Distributional Hypothesis Does Not Fully Explain the Benefits of
Masked Language Model Pretraining [27.144616560712493]
We investigate whether better sample efficiency and the better generalization capability of models pretrained with masked language modeling can be attributed to the semantic similarity encoded in the pretraining data's distributional property.
Our results illustrate our limited understanding of model pretraining and provide future research directions.
arXiv Detail & Related papers (2023-10-25T00:31:29Z) - A Mathematical Framework for Learning Probability Distributions [0.0]
generative modeling and density estimation has become an immensely popular subject in recent years.
This paper provides a mathematical framework such that all the well-known models can be derived based on simple principles.
In particular, we prove that these models enjoy implicit regularization during training, so that the generalization error at early-stopping avoids the curse of dimensionality.
arXiv Detail & Related papers (2022-12-22T04:41:45Z) - Exploring Target Representations for Masked Autoencoders [78.57196600585462]
We show that a careful choice of the target representation is unnecessary for learning good representations.
We propose a multi-stage masked distillation pipeline and use a randomly model as the teacher.
A proposed method to perform masked knowledge distillation with bootstrapped teachers (dBOT) outperforms previous self-supervised methods by nontrivial margins.
arXiv Detail & Related papers (2022-09-08T16:55:19Z) - Masked prediction tasks: a parameter identifiability view [49.533046139235466]
We focus on the widely used self-supervised learning method of predicting masked tokens.
We show that there is a rich landscape of possibilities, out of which some prediction tasks yield identifiability, while others do not.
arXiv Detail & Related papers (2022-02-18T17:09:32Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Demystifying Code Summarization Models [5.608277537412537]
We evaluate four prominent code summarization models: extreme summarizer, code2vec, code2seq, and sequence GNN.
Results show that all models base their predictions on syntactic and lexical properties with little to none semantic implication.
We present a novel approach to explaining the predictions of code summarization models through the lens of training data.
arXiv Detail & Related papers (2021-02-09T03:17:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.