Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little
- URL: http://arxiv.org/abs/2104.06644v1
- Date: Wed, 14 Apr 2021 06:30:36 GMT
- Title: Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little
- Authors: Koustuv Sinha, Robin Jia, Dieuwke Hupkes, Joelle Pineau, Adina
Williams, Douwe Kiela
- Abstract summary: A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
- Score: 74.49773960145681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A possible explanation for the impressive performance of masked language
model (MLM) pre-training is that such models have learned to represent the
syntactic structures prevalent in classical NLP pipelines. In this paper, we
propose a different explanation: MLMs succeed on downstream tasks almost
entirely due to their ability to model higher-order word co-occurrence
statistics. To demonstrate this, we pre-train MLMs on sentences with randomly
shuffled word order, and show that these models still achieve high accuracy
after fine-tuning on many downstream tasks -- including on tasks specifically
designed to be challenging for models that ignore word order. Our models
perform surprisingly well according to some parametric syntactic probes,
indicating possible deficiencies in how we test representations for syntactic
information. Overall, our results show that purely distributional information
largely explains the success of pre-training, and underscore the importance of
curating challenging evaluation datasets that require deeper linguistic
knowledge.
Related papers
- Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
Large language models (LLMs) have sparked debate over whether they genuinely generalize to unseen tasks or rely on memorizing vast amounts of pretraining data.
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency.
This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Learning to Reduce: Optimal Representations of Structured Data in
Prompting Large Language Models [42.16047343029512]
Large Language Models (LLMs) have been widely used as general-purpose AI agents.
We propose a framework, Learning to Reduce, that fine-tunes a language model to generate a reduced version of an input context.
We show that our model achieves comparable accuracies in selecting the relevant evidence from an input context.
arXiv Detail & Related papers (2024-02-22T00:41:23Z) - Learning to Generate Explainable Stock Predictions using Self-Reflective
Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions.
A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations.
Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z) - Instruction Position Matters in Sequence Generation with Large Language
Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization.
We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z) - On Conditional and Compositional Language Model Differentiable Prompting [75.76546041094436]
Prompts have been shown to be an effective method to adapt a frozen Pretrained Language Model (PLM) to perform well on downstream tasks.
We propose a new model, Prompt Production System (PRopS), which learns to transform task instructions or input metadata, into continuous prompts.
arXiv Detail & Related papers (2023-07-04T02:47:42Z) - AStitchInLanguageModels: Dataset and Methods for the Exploration of
Idiomaticity in Pre-Trained Language Models [7.386862225828819]
This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings.
We use this dataset in two tasks designed to test i) a language model's ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms.
arXiv Detail & Related papers (2021-09-09T16:53:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.