Related papers: ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models

ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models

URL: http://arxiv.org/abs/2501.13397v4
Date: Wed, 05 Feb 2025 08:17:30 GMT
Title: ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models
Authors: Kangjie Zheng, Junwei Yang, Siyue Liang, Bin Feng, Zequn Liu, Wei Ju, Zhiping Xiao, Ming Zhang,
Abstract summary: Masked Language Models (ML)Mss are trained by randomly masking portions of the input sequences with [MASK] tokens and learning to reconstruct the original content based on the remaining context.
Score: 11.997499811414837
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly masking portions of the input sequences with [MASK] tokens and learning to reconstruct the original content based on the remaining context. This paper explores the impact of [MASK] tokens on MLMs. Analytical studies show that masking tokens can introduce the corrupted semantics problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands [MASK] tokens in the input context and models the dependencies between these expanded states. This enhancement increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enriches semantic representations through context enhancement, and effectively reduces the semantic multimodality commonly observed in MLMs.

Related papers

Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models [19.847438086389616]
Masked Diffusion Language Models have emerged as a promising alternative to Autoregressive Language Models.<n>We show that MDLMs exhibit a strong locality bias, favouring local over distant context.<n>We introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks.
arXiv Detail & Related papers (2025-11-26T12:44:29Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
Token Activation Map to Visually Explain Multimodal LLMs [23.774995444587667]
We propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation.<n>We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens.<n>Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results.
arXiv Detail & Related papers (2025-06-29T14:50:45Z)
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval [50.062817677022586]
Zero-Shot Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens.<n>We propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI) to construct two complementary training tasks using only unlabeled images.
arXiv Detail & Related papers (2025-05-26T08:56:59Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy [37.471419716572086]
There is a significant gap in instruction-following capabilities between Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) We propose Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies to alleviate this gap.
arXiv Detail & Related papers (2024-11-23T05:03:32Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language. This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok) SeTok groups visual features into semantic units via a dynamic clustering algorithm. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning [77.72128397088409]
We show that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question. We also propose a novel reinforcement learning paradigm to encourage the model to actively perform composite deduction.
arXiv Detail & Related papers (2024-04-19T15:53:27Z)
The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM) We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions. Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z)
Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities. To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z)
Masked and Permuted Implicit Context Learning for Scene Text Recognition [8.742571493814326]
Scene Recognition (STR) is difficult because of variations in text styles, shapes, and backgrounds. We propose a masked and permuted implicit context learning network for STR, within a single decoder.
arXiv Detail & Related papers (2023-05-25T15:31:02Z)
Learning In-context Learning for Named Entity Recognition [54.022036267886214]
Named entity recognition in real-world applications suffers from the diversity of entity types, the emergence of new entity types, and the lack of high-quality annotations. This paper proposes an in-context learning-based NER approach, which can effectively inject in-context NER ability into PLMs. We show that our method can effectively inject in-context NER ability into PLMs and significantly outperforms the PLMs+fine-tuning counterparts.
arXiv Detail & Related papers (2023-05-18T15:31:34Z)
Representation Deficiency in Masked Language Modeling [107.39136254013042]
We propose MAE-LM, which pretrains the Masked Autoencoder architecture with where $tt[MASK]$ tokens are excluded from the encoder. We show that MAE-LM consistently outperforms pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
arXiv Detail & Related papers (2023-02-04T01:54:17Z)
Contextual Representation Learning beyond Masked Language Modeling [45.46220173487394]
We analyze language models (MLMs) such as BERT learn contextually. To address these issues, we propose TACO, a representation learning approach directly directly global semantics. TACO extracts contextual semantics hidden in contextualized representations to encourage models to attend global semantics.
arXiv Detail & Related papers (2022-04-08T16:18:06Z)
Warped Language Models for Noise Robust Language Understanding [11.017026606760728]
Masked Language Models (MLM) are self-supervised neural networks trained fill in the blanks in a given sentence with masked tokens. We show that natural language understanding systems built on top of WLMs perform better compared to those built on conversationals.
arXiv Detail & Related papers (2020-11-03T18:26:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.