Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and
Simplicity Bias in MLMs
- URL: http://arxiv.org/abs/2309.07311v5
- Date: Wed, 7 Feb 2024 21:40:55 GMT
- Title: Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and
Simplicity Bias in MLMs
- Authors: Angelica Chen, Ravid Shwartz-Ziv, Kyunghyun Cho, Matthew L. Leavitt,
Naomi Saphra
- Abstract summary: We present a case study of syntax acquisition in masked language models (MLMs)
We study Syntactic Attention Structure (SAS), a naturally emerging property of accessibles wherein specific Transformer heads tend to focus on specific syntactic relations.
We examine the causal role of SAS by manipulating SAS during training, and demonstrate that SAS is necessary for the development of grammatical capabilities.
- Score: 50.5783641817253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most interpretability research in NLP focuses on understanding the behavior
and features of a fully trained model. However, certain insights into model
behavior may only be accessible by observing the trajectory of the training
process. We present a case study of syntax acquisition in masked language
models (MLMs) that demonstrates how analyzing the evolution of interpretable
artifacts throughout training deepens our understanding of emergent behavior.
In particular, we study Syntactic Attention Structure (SAS), a naturally
emerging property of MLMs wherein specific Transformer heads tend to focus on
specific syntactic relations. We identify a brief window in pretraining when
models abruptly acquire SAS, concurrent with a steep drop in loss. This
breakthrough precipitates the subsequent acquisition of linguistic
capabilities. We then examine the causal role of SAS by manipulating SAS during
training, and demonstrate that SAS is necessary for the development of
grammatical capabilities. We further find that SAS competes with other
beneficial traits during training, and that briefly suppressing SAS improves
model quality. These findings offer an interpretation of a real-world example
of both simplicity bias and breakthrough training dynamics.
Related papers
- Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We investigate the interplay between generalization and memorization in large language models at scale.
With various sizes of open-source LLMs and their pretraining corpora, we observe that as the model size increases, the task-relevant $n$-gram pair data becomes increasingly important.
Our results support the hypothesis that LLMs' capabilities emerge from a delicate balance of memorization and generalization with sufficient task-related pretraining data.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Latent Causal Probing: A Formal Perspective on Probing with Causal Models of Data [3.988614978933934]
We develop a formal perspective on probing using structural causal models (SCM)
We extend a recent study of LMs in the context of a synthetic grid-world navigation task.
Our techniques provide robust empirical evidence for the ability of LMs to learn the latent causal concepts underlying text.
arXiv Detail & Related papers (2024-07-18T17:59:27Z) - Advances in Self-Supervised Learning for Synthetic Aperture Sonar Data
Processing, Classification, and Pattern Recognition [0.36700088931938835]
This paper proposes MoCo-SAS that leverages self-supervised learning for SAS data processing, classification, and pattern recognition.
The experimental results demonstrate that MoCo-SAS significantly outperforms traditional supervised learning methods.
These findings highlight the potential of SSL in advancing the state-of-the-art in SAS data processing, offering promising avenues for enhanced underwater object detection and classification.
arXiv Detail & Related papers (2023-08-12T20:59:39Z) - Concept-aware Training Improves In-context Learning Ability of Language
Models [0.0]
Many recent language models (LMs) of Transformers family exhibit so-called in-context learning (ICL) ability.
We propose a method to create LMs able to better utilize the in-context information.
We measure that data sampling of Concept-aware Training consistently improves models' reasoning ability.
arXiv Detail & Related papers (2023-05-23T07:44:52Z) - An Explanation of In-context Learning as Implicit Bayesian Inference [117.19809377740188]
We study the role of the pretraining distribution on the emergence of in-context learning.
We prove that in-context learning occurs implicitly via Bayesian inference of the latent concept.
We empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
arXiv Detail & Related papers (2021-11-03T09:12:33Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Did the Cat Drink the Coffee? Challenging Transformers with Generalized
Event Knowledge [59.22170796793179]
Transformers Language Models (TLMs) were tested on a benchmark for the textitdynamic estimation of thematic fit
Our results show that TLMs can reach performances that are comparable to those achieved by SDM.
However, additional analysis consistently suggests that TLMs do not capture important aspects of event knowledge.
arXiv Detail & Related papers (2021-07-22T20:52:26Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.