Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting
- URL: http://arxiv.org/abs/2406.00053v3
- Date: Sat, 01 Mar 2025 18:42:10 GMT
- Title: Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting
- Authors: Suraj Anand, Michael A. Lepori, Jack Merullo, Ellie Pavlick,
- Abstract summary: Language models have the ability to perform in-context learning (ICL)<n>Despite their apparent ability to learn in-context, language models are known to struggle when faced with unseen or rarely seen tokens.<n>We study structural in-context algorithms on both synthetic and naturalistic tasks using toy models, masked language models, and autoregressive language models.
- Score: 15.69952375347308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models have the ability to perform in-context learning (ICL), allowing them to flexibly adapt their behavior based on context. This contrasts with in-weights learning (IWL), where memorized information is encoded in model parameters after iterated observations of data. An ideal model should be able to flexibly deploy both of these abilities. Despite their apparent ability to learn in-context, language models are known to struggle when faced with unseen or rarely seen tokens (Land & Bartolo, 2024). Hence, we study $\textbf{structural in-context learning}$, which we define as the ability of a model to execute in-context learning on arbitrary novel tokens -- so called because the model must generalize on the basis of e.g. sentence structure or task structure, rather than content encoded in token embeddings. We study structural in-context algorithms on both synthetic and naturalistic tasks using toy models, masked language models, and autoregressive language models. We find that structural ICL appears before quickly disappearing early in LM pretraining. While it has been shown that ICL can diminish during training (Singh et al., 2023), we find that prior work does not account for structural ICL. Building on Chen et al. (2024) 's active forgetting method, we introduce pretraining and finetuning methods that can modulate the preference for structural ICL and IWL. Importantly, this allows us to induce a $\textit{dual process strategy}$ where in-context and in-weights solutions coexist within a single model.
Related papers
- Controllable Context Sensitivity and the Knob Behind It [53.70327066130381]
When making predictions, a language model must trade off how much it relies on its context vs. its prior knowledge.
We search for a knob which controls this sensitivity, determining whether language models answer from the context or their prior knowledge.
arXiv Detail & Related papers (2024-11-11T22:22:21Z) - Exploring Efficient Foundational Multi-modal Models for Video Summarization [15.418001616659808]
Such video foundation models perform pre-training by aligning outputs from each modality-specific model into the same embedding space.
We propose a plug-and-play video language model, using texts generated from each input modality into the language model.
We compare the performance versus the computational costs for our plug-and-play style method and baseline tuning methods.
arXiv Detail & Related papers (2024-10-09T20:07:06Z) - From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When [19.841163050181194]
Large language models (LLMs) like transformers demonstrate impressive in-context learning (ICL) capabilities.
We examine what enables ICL in models trained on unstructured data, focusing on critical sequence model requirements and training data structure.
We find that many ICL capabilities can emerge simply from co-occurrence of semantically related word pairs in unstructured data.
We identify two cases where ICL fails: one in logic reasoning tasks that require generalizing to new, unseen patterns, and another in analogy completion where relevant word pairs appear only in fixed training positions.
arXiv Detail & Related papers (2024-05-31T18:46:06Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - Auto-ICL: In-Context Learning without Human Supervision [93.05202223767463]
We propose Automatic In-Context Learning framework that enables the model to autonomously generate examples and instructions for problem-solving.
With experiments across various models and datasets, results show that model-generated contexts outperform human-annotated contexts.
arXiv Detail & Related papers (2023-11-15T07:37:28Z) - EXnet: Efficient In-context Learning for Data-less Text classification [0.0]
We present EXnet, a model specifically designed to perform in-context learning without limitations on the number of examples.
We argue that in-context learning is an effective method to increase task accuracy, and providing examples facilitates cross-task generalization.
With extensive experiments, we show that even our smallest model (15M parameters) generalizes to several unseen classification tasks and domains.
arXiv Detail & Related papers (2023-05-24T01:40:57Z) - Concept-aware Training Improves In-context Learning Ability of Language
Models [0.0]
Many recent language models (LMs) of Transformers family exhibit so-called in-context learning (ICL) ability.
We propose a method to create LMs able to better utilize the in-context information.
We measure that data sampling of Concept-aware Training consistently improves models' reasoning ability.
arXiv Detail & Related papers (2023-05-23T07:44:52Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language
Models [58.42146641102329]
We develop a novel semi-parametric language model architecture, Knowledge-in-Context (KiC)
KiC empowers a parametric text-to-text language model with a knowledge-rich external memory.
As a knowledge-rich semi-parametric language model, KiC only needs a much smaller part to achieve superior zero-shot performance on unseen tasks.
arXiv Detail & Related papers (2022-10-28T23:18:43Z) - What Can Transformers Learn In-Context? A Case Study of Simple Function
Classes [67.06980111346245]
In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples.
We show that standard Transformers can be trained from scratch to perform in-context learning of linear functions.
We also show that we can train Transformers to in-context learn more complex function classes with performance that matches or exceeds task-specific learning algorithms.
arXiv Detail & Related papers (2022-08-01T18:01:40Z) - DeepStruct: Pretraining of Language Models for Structure Prediction [64.84144849119554]
We pretrain language models on a collection of task-agnostic corpora to generate structures from text.
Our structure pretraining enables zero-shot transfer of the learned knowledge that models have about the structure tasks.
We show that a 10B parameter language model transfers non-trivially to most tasks and obtains state-of-the-art performance on 21 of 28 datasets.
arXiv Detail & Related papers (2022-05-21T00:58:22Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - How Far are We from Effective Context Modeling? An Exploratory Study on
Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it.
We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.