On a Benefit of Mask Language Modeling: Robustness to Simplicity Bias
- URL: http://arxiv.org/abs/2110.05301v1
- Date: Mon, 11 Oct 2021 14:18:29 GMT
- Title: On a Benefit of Mask Language Modeling: Robustness to Simplicity Bias
- Authors: Ting-Rui Chiang
- Abstract summary: Despite success of pretrained language models (MLM), why pretraining is useful is still not fully answered.
We theoretically and empirically show that conditioned pretraining makes models robust to-level spurious features, partly answer the question.
We close the gap between our theories and the real world practices by conducting experiments on the hate speech detection and the name entity recognition tasks.
- Score: 4.7210697296108926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the success of pretrained masked language models (MLM), why MLM
pretraining is useful is still a qeustion not fully answered. In this work we
theoretically and empirically show that MLM pretraining makes models robust to
lexicon-level spurious features, partly answer the question. We theoretically
show that, when we can model the distribution of a spurious feature $\Pi$
conditioned on the context, then (1) $\Pi$ is at least as informative as the
spurious feature, and (2) learning from $\Pi$ is at least as simple as learning
from the spurious feature. Therefore, MLM pretraining rescues the model from
the simplicity bias caused by the spurious feature. We also explore the
efficacy of MLM pretraing in causal settings. Finally we close the gap between
our theories and the real world practices by conducting experiments on the hate
speech detection and the name entity recognition tasks.
Related papers
- ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models [11.997499811414837]
Masked Language Models (ML)Mss are trained by randomly masking portions of the input sequences with [MASK] tokens and learning to reconstruct the original content based on the remaining context.
arXiv Detail & Related papers (2025-01-23T05:46:50Z) - Learning to Generate Explainable Stock Predictions using Self-Reflective
Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions.
A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations.
Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z) - Are Large Language Models Temporally Grounded? [38.481606493496514]
We provide Large language models (LLMs) with textual narratives.
We probe them with respect to their common-sense knowledge of the structure and duration of events.
We evaluate state-of-the-art LLMs on three tasks reflecting these abilities.
arXiv Detail & Related papers (2023-11-14T18:57:15Z) - Democratizing Reasoning Ability: Tailored Learning from Large Language
Model [97.4921006089966]
We propose a tailored learning approach to distill such reasoning ability to smaller LMs.
We exploit the potential of LLM as a reasoning teacher by building an interactive multi-round learning paradigm.
To exploit the reasoning potential of the smaller LM, we propose self-reflection learning to motivate the student to learn from self-made mistakes.
arXiv Detail & Related papers (2023-10-20T07:50:10Z) - Representation Deficiency in Masked Language Modeling [107.39136254013042]
We propose MAE-LM, which pretrains the Masked Autoencoder architecture with where $tt[MASK]$ tokens are excluded from the encoder.
We show that MAE-LM consistently outperforms pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
arXiv Detail & Related papers (2023-02-04T01:54:17Z) - Fast, Effective and Self-Supervised: Transforming Masked LanguageModels
into Universal Lexical and Sentence Encoders [66.76141128555099]
We show that it is possible to turn tasks into universal lexical and sentence encoders even without any additional data and without supervision.
We propose an extremely simple, fast and effective contrastive learning technique, termed Mirror-BERT.
Mirror-BERT relies on fully identical or slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples.
We report huge gains over off-the-shelfs with Mirror-BERT in both lexical-level and sentence-level tasks, across different domains and different languages.
arXiv Detail & Related papers (2021-04-16T10:49:56Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - Warped Language Models for Noise Robust Language Understanding [11.017026606760728]
Masked Language Models (MLM) are self-supervised neural networks trained fill in the blanks in a given sentence with masked tokens.
We show that natural language understanding systems built on top of WLMs perform better compared to those built on conversationals.
arXiv Detail & Related papers (2020-11-03T18:26:28Z) - Language Model Prior for Low-Resource Neural Machine Translation [85.55729693003829]
We propose a novel approach to incorporate a LM as prior in a neural translation model (TM)
We add a regularization term, which pushes the output distributions of the TM to be probable under the LM prior.
Results on two low-resource machine translation datasets show clear improvements even with limited monolingual data.
arXiv Detail & Related papers (2020-04-30T16:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.