Related papers: Out-of-distribution generalization via composition: a lens through induction heads in Transformers

Out-of-distribution generalization via composition: a lens through induction heads in Transformers

URL: http://arxiv.org/abs/2408.09503v1
Date: Sun, 18 Aug 2024 14:52:25 GMT
Title: Out-of-distribution generalization via composition: a lens through induction heads in Transformers
Authors: Jiajun Song, Zhuoyan Xu, Yiqiao Zhong,
Abstract summary: Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data -- which is known as out-of-distribution (OOD) generalization. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning.
Score: 0.46085106405479537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data -- which is known as out-of-distribution (OOD) generalization. Despite the tremendous success of LLMs, how they approach OOD generalization remains an open and underexplored question. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning. Models are required to infer the hidden rules behind input prompts without any fine-tuning. We empirically examined the training dynamics of Transformers on a synthetic example and conducted extensive experiments on a variety of pretrained LLMs, focusing on a type of components known as induction heads. We found that OOD generalization and composition are tied together -- models can learn rules by composing two self-attention layers, thereby achieving OOD generalization. Furthermore, a shared latent subspace in the embedding (or feature) space acts as a bridge for composition by aligning early layers and later layers, which we refer to as the common bridge representation hypothesis.

Related papers

Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization [15.028208772567487]
We use case studies of English grammar to explore how complex, diverse training data drives models to generalize OOD. We show that these factors are nuanced and that intermediate levels of diversity and complexity lead to inconsistent behavior across random seeds. Our findings emphasize the critical role of training data in shaping generalization patterns and illuminate how competing model strategies lead to inconsistent generalization outcomes across random seeds.
arXiv Detail & Related papers (2024-12-05T21:12:37Z)
Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts [14.76420070558434]
Rule extrapolation describes OOD scenarios, where the prompt violates at least one rule. We focus on formal languages, which are defined by the intersection of rules. We lay the first stones of a normative theory of rule extrapolation, inspired by the Solomonoff prior in algorithmic information theory.
arXiv Detail & Related papers (2024-09-09T22:36:35Z)
Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically [74.96551626420188]
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures. We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
arXiv Detail & Related papers (2024-04-25T07:10:29Z)
Towards out-of-distribution generalization in large-scale astronomical surveys: robust networks learn similar representations [3.653721769378018]
We use Centered Kernel Alignment (CKA), a similarity measure metric of neural network representations, to examine the relationship between representation similarity and performance. We find that when models are robust to a distribution shift, they produce substantially different representations across their layers on OOD data. We discuss the potential application of similarity representation in guiding model design, training strategy, and mitigating the OOD problem by incorporating CKA as an inductive bias during training.
arXiv Detail & Related papers (2023-11-29T19:00:05Z)
How to Handle Different Types of Out-of-Distribution Scenarios in Computational Argumentation? A Comprehensive and Fine-Grained Field Study [59.13867562744973]
This work systematically assesses LMs' capabilities for out-of-distribution (OOD) scenarios. We find that the efficacy of such learning paradigms varies with the type of OOD. Specifically, while ICL excels for domain shifts, prompt-based fine-tuning surpasses for topic shifts.
arXiv Detail & Related papers (2023-09-15T11:15:47Z)
It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models [6.065846799248359]
Large language models (LLMs) have achieved remarkable proficiency on solving diverse problems. However, their generalization ability is not always satisfying and the generalization problem is common for generative transformer models in general. We show that when training models on n-digit operations, models generalize successfully on unseen n-digit inputs, but fail miserably on longer, unseen cases.
arXiv Detail & Related papers (2023-08-16T10:09:42Z)
DIVERSIFY: A General Framework for Time Series Out-of-distribution Detection and Generalization [58.704753031608625]
Time series is one of the most challenging modalities in machine learning research. OOD detection and generalization on time series tend to suffer due to its non-stationary property. We propose DIVERSIFY, a framework for OOD detection and generalization on dynamic distributions of time series.
arXiv Detail & Related papers (2023-08-04T12:27:11Z)
On a Built-in Conflict between Deep Learning and Systematic Generalization [2.588973722689844]
Internal function sharing is one of the reasons to weaken o.o.d. or systematic generalization in deep learning. We show such phenomena in standard deep learning models, such as fully connected, convolutional, residual networks, LSTMs, and (Vision) Transformers.
arXiv Detail & Related papers (2022-08-24T16:06:36Z)
Generalization in Multimodal Language Learning from Simulation [20.751952728808153]
We investigate the influence of the underlying training data distribution on generalization in a minimal LSTM-based network trained in a supervised, time continuous setting. We find compositional generalization to fail in simple setups while improving with the number of objects, actions, and particularly with a lot of color overlaps between objects.
arXiv Detail & Related papers (2021-08-03T12:55:18Z)
Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features. This simplicity bias can explain their lack of robustness out of distribution (OOD) We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z)
Improving Compositional Generalization in Semantic Parsing [54.4720965813889]
Generalization of models to out-of-distribution (OOD) data has captured tremendous attention recently. We investigate compositional generalization in semantic parsing, a natural test-bed for compositional generalization.
arXiv Detail & Related papers (2020-10-12T12:34:58Z)
MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input. MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.