Related papers: Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

URL: http://arxiv.org/abs/2502.19249v2
Date: Tue, 27 May 2025 04:23:34 GMT
Title: Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Authors: Michael Y. Hu, Jackson Petty, Chuan Shi, William Merrill, Tal Linzen,
Abstract summary: Pretraining language models on formal language can improve their acquisition of natural language.<n>We show that languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language.<n>We also give mechanistic evidence of transfer from formal to natural language.
Score: 47.920937001420505
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language should fall within the computational limitations of the architecture. Strikingly, pre-pretraining reduces loss more efficiently than training on a matched amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. Finally, we also give mechanistic evidence of transfer from formal to natural language: attention heads acquired during pre-pretraining remain crucial for the model's performance on syntactic evaluations.

Related papers

On the Acquisition of Shared Grammatical Representations in Bilingual Language Models [6.266732217239363]
We ask what happens to a monolingual language model when it begins to be trained on a second language. To find evidence of shared multilingual representations, we turn to structural priming, a method used to study grammatical representations in humans. We argue that this asymmetry may shape hypotheses about human structural priming effects.
arXiv Detail & Related papers (2025-03-05T23:27:58Z)
The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance. We observe that the existence of a predominant language during training boosts the performance of less frequent languages. As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z)
Machine Translation to Control Formality Features in the Target Language [0.9208007322096532]
This research explores how machine learning methods are used to translate from English to languages with formality. It was done by training a bilingual model in a formality-controlled setting and comparing its performance with a pre-trained multilingual model. We evaluate the official formality accuracy(ACC) by comparing the predicted masked tokens with the ground truth.
arXiv Detail & Related papers (2023-11-22T15:42:51Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
Language Models as Inductive Reasoners [125.99461874008703]
We propose a new paradigm (task) for inductive reasoning, which is to induce natural language rules from natural language facts. We create a dataset termed DEER containing 1.2k rule-fact pairs for the task, where rules and facts are written in natural language. We provide the first and comprehensive analysis of how well pretrained language models can induce natural language rules from natural language facts.
arXiv Detail & Related papers (2022-12-21T11:12:14Z)
Overcoming Barriers to Skill Injection in Language Modeling: Case Study in Arithmetic [14.618731441943847]
We develop a novel framework that enables language models to be mathematically proficient while retaining their linguistic prowess. Specifically, we offer information-theoretic interventions to overcome the catastrophic forgetting of linguistic skills that occurs while injecting non-linguistic skills into language models.
arXiv Detail & Related papers (2022-11-03T18:53:30Z)
Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions. Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z)
Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages. Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning. We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z)
Language Models are not Models of Language [0.0]
Transfer learning has enabled large deep learning neural networks trained on the language modeling task to vastly improve performance. We argue that the term language model is misleading because deep learning models are not theoretical models of language.
arXiv Detail & Related papers (2021-12-13T22:39:46Z)
Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages. We infer this distribution from a sample of typologically diverse training languages. We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z)
Improving the Lexical Ability of Pretrained Language Models for Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages. Previous research has shown that this is because the representations are not sufficiently aligned. In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z)
Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models [27.91397366776451]
Training LSTMs on latent structure (MIDI music or Java code) improves test performance on natural language. Experiments on transfer between natural languages controlling for vocabulary overlap show that zero-shot performance on a test language is highly correlated with typological similarity to the training language.
arXiv Detail & Related papers (2020-04-30T06:24:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.