Related papers: Representation Deficiency in Masked Language Modeling

Representation Deficiency in Masked Language Modeling

URL: http://arxiv.org/abs/2302.02060v2
Date: Sat, 16 Mar 2024 04:28:35 GMT
Title: Representation Deficiency in Masked Language Modeling
Authors: Yu Meng, Jitin Krishnan, Sinong Wang, Qifan Wang, Yuning Mao, Han Fang, Marjan Ghazvininejad, Jiawei Han, Luke Zettlemoyer,
Abstract summary: We propose MAE-LM, which pretrains the Masked Autoencoder architecture with where $tt[MASK]$ tokens are excluded from the encoder. We show that MAE-LM consistently outperforms pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
Score: 107.39136254013042
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Masked Language Modeling (MLM) has been one of the most prominent approaches for pretraining bidirectional text encoders due to its simplicity and effectiveness. One notable concern about MLM is that the special $\texttt{[MASK]}$ symbol causes a discrepancy between pretraining data and downstream data as it is present only in pretraining but not in fine-tuning. In this work, we offer a new perspective on the consequence of such a discrepancy: We demonstrate empirically and theoretically that MLM pretraining allocates some model dimensions exclusively for representing $\texttt{[MASK]}$ tokens, resulting in a representation deficiency for real tokens and limiting the pretrained model's expressiveness when it is adapted to downstream data without $\texttt{[MASK]}$ tokens. Motivated by the identified issue, we propose MAE-LM, which pretrains the Masked Autoencoder architecture with MLM where $\texttt{[MASK]}$ tokens are excluded from the encoder. Empirically, we show that MAE-LM improves the utilization of model dimensions for real token representations, and MAE-LM consistently outperforms MLM-pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.

Related papers

Should We Still Pretrain Encoders with Masked Language Modeling? [27.19054714197245]
Recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders.<n>We train a total of 38 models ranging from 210 million to 1 billion parameters, and conduct over 15,000 fine-tuning and evaluation runs.<n>We find that while training with high-level CLM yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability.
arXiv Detail & Related papers (2025-07-01T17:45:48Z)
$\texttt{SEM-CTRL}$: Semantically Controlled Decoding [53.86639808659575]
$texttSEM-CTRL$ is a unified approach that enforces rich context-sensitive constraints and task- and instance-specific semantics directly on an LLM decoder. texttSEM-CTRL$ allows small pre-trained LLMs to efficiently outperform larger variants and state-of-the-art reasoning models.
arXiv Detail & Related papers (2025-03-03T18:33:46Z)
Enhancing DNA Foundation Models to Address Masking Inefficiencies [18.54660252939211]
We propose a modified encoder-decoder architecture based on the masked autoencoder framework. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes.
arXiv Detail & Related papers (2025-02-25T17:56:25Z)
Enabling Autoregressive Models to Fill In Masked Tokens [50.9948753314669]
This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that achieves state-of-the-art masked infilling performance.<n>MARIA combines a pre-trained and AR model by training a linear decoder that takes their hidden states as input.<n>Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
arXiv Detail & Related papers (2025-02-09T20:02:05Z)
ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models [11.997499811414837]
Masked Language Models (ML)Mss are trained by randomly masking portions of the input sequences with [MASK] tokens and learning to reconstruct the original content based on the remaining context.
arXiv Detail & Related papers (2025-01-23T05:46:50Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE) DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z)
Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities. To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z)
Masked and Permuted Implicit Context Learning for Scene Text Recognition [8.742571493814326]
Scene Recognition (STR) is difficult because of variations in text styles, shapes, and backgrounds. We propose a masked and permuted implicit context learning network for STR, within a single decoder.
arXiv Detail & Related papers (2023-05-25T15:31:02Z)
Inconsistencies in Masked Language Models [20.320583166619528]
Masked language models (MLMs) can provide distributions of tokens in the masked positions in a sequence. distributions corresponding to different masking patterns can demonstrate considerable inconsistencies. We propose an inference-time strategy for fors called Ensemble of Conditionals.
arXiv Detail & Related papers (2022-12-30T22:53:25Z)
On a Benefit of Mask Language Modeling: Robustness to Simplicity Bias [4.7210697296108926]
Despite success of pretrained language models (MLM), why pretraining is useful is still not fully answered. We theoretically and empirically show that conditioned pretraining makes models robust to-level spurious features, partly answer the question. We close the gap between our theories and the real world practices by conducting experiments on the hate speech detection and the name entity recognition tasks.
arXiv Detail & Related papers (2021-10-11T14:18:29Z)
Frustratingly Simple Pretraining Alternatives to Masked Language Modeling [10.732163031244651]
Masked language modeling (MLM) is widely used in natural language processing for learning text representations. In this paper, we explore five simple pretraining objectives based on token-level classification tasks as replacements of representations.
arXiv Detail & Related papers (2021-09-04T08:52:37Z)
Fast, Effective and Self-Supervised: Transforming Masked LanguageModels into Universal Lexical and Sentence Encoders [66.76141128555099]
We show that it is possible to turn tasks into universal lexical and sentence encoders even without any additional data and without supervision. We propose an extremely simple, fast and effective contrastive learning technique, termed Mirror-BERT. Mirror-BERT relies on fully identical or slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples. We report huge gains over off-the-shelfs with Mirror-BERT in both lexical-level and sentence-level tasks, across different domains and different languages.
arXiv Detail & Related papers (2021-04-16T10:49:56Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction [54.569707226277735]
Previous methods have potential drawbacks when applied to an EncDec model. Our proposed method fine-tune a corpus and then use the output fine-tuned as additional features in the GEC model. The best-performing model state-of-the-art performances on the BEA 2019 and CoNLL-2014 benchmarks.
arXiv Detail & Related papers (2020-05-03T04:49:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.