Related papers: Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs

Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs

URL: http://arxiv.org/abs/2510.20475v1
Date: Thu, 23 Oct 2025 12:15:24 GMT
Title: Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs
Authors: Lukas Edman, Alexander Fraser,
Abstract summary: We describe our strategy for the 2025 edition of the BabyLM Challenge.<n>Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model's ability to predict them.
Score: 54.626578706811436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We describe our strategy for the 2025 edition of the BabyLM Challenge. Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model's ability to predict them. The results show a substantial increase in performance on (Super)GLUE tasks over the standard MLM. We also incorporate sub-token embeddings, finding that this increases the model's morphological generalization capabilities. Our submission beats the baseline in the strict-small track.

Related papers

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens [69.97021957331326]
We propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization.<n>We also introduce a VAE branch with linear projection to recover fine-grained image details.
arXiv Detail & Related papers (2025-12-02T09:02:20Z)
Boosting Large Language Models with Mask Fine-Tuning [60.56962908455601]
We introduce Mask Fine-Tuning (MFT) to show that properly breaking the integrity of the model can surprisingly lead to improved performance.<n>Experiments show that MFT gains a consistent performance boost across various domains and backbones.
arXiv Detail & Related papers (2025-03-27T20:17:57Z)
AntLM: Bridging Causal and Masked Language Models [17.674125980976665]
Causal Language Modeling (CLM) Masked Language Modeling (MLM) are two mainstream paradigms learning based on Transformer networks.<n>We propose a novel language modeling paradigm named $bfAntLM$, which integrates both CLM andtext.
arXiv Detail & Related papers (2024-12-04T12:34:15Z)
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension [45.856469849910496]
Vision Language Models (VLMs) have demonstrated remarkable capabilities in various open-vocabulary tasks, yet their zero-shot performance lags behind task-specific fine-tuned models.<n>We propose LLM-wrapper, a method for 'black-box' adaptation ofVLMs for the Referring Expression task using Large Language Models (LLMs)<n>Our approach offers several advantages: it enables the adaptation of closed-source models without needing access to their internal workings.
arXiv Detail & Related papers (2024-09-18T12:32:25Z)
Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities. To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z)
Uniform Masking Prevails in Vision-Language Pretraining [26.513450527203453]
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining. This paper shows that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks.
arXiv Detail & Related papers (2022-12-10T04:02:19Z)
Learning Better Masking for Better Language Model Pre-training [80.31112722910787]
Masked Language Modeling has been widely used as denoising objective in pre-training language models (PrLMs) PrLMs commonly adopt a Random-Token Masking strategy where a fixed masking ratio is applied and different contents are masked by an equal probability throughout the entire training. We propose two scheduled masking approaches that adaptively tune the masking ratio and masked content in different training stages.
arXiv Detail & Related papers (2022-08-23T08:27:52Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order [32.71489048856101]
Masked language model and autoregressive language model are two types of language models. We propose a probabilistic masking scheme for the masked language model, which we call probabilistically masked language model (PMLM) We prove that u-PMLM is equivalent to an autoregressive permutated language model.
arXiv Detail & Related papers (2020-04-24T07:38:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.