Related papers: It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models

It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models

URL: http://arxiv.org/abs/2308.08268v2
Date: Thu, 4 Jul 2024 06:32:57 GMT
Title: It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models
Authors: Xingcheng Xu, Zihao Pan, Haipeng Zhang, Yanqing Yang,
Abstract summary: Large language models (LLMs) have achieved remarkable proficiency on solving diverse problems. However, their generalization ability is not always satisfying and the generalization problem is common for generative transformer models in general. We show that when training models on n-digit operations, models generalize successfully on unseen n-digit inputs, but fail miserably on longer, unseen cases.
Score: 6.065846799248359
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have achieved remarkable proficiency on solving diverse problems. However, their generalization ability is not always satisfying and the generalization problem is common for generative transformer models in general. Researchers take basic mathematical tasks like n-digit addition or multiplication as important perspectives for investigating their generalization behaviors. It is observed that when training models on n-digit operations (e.g., additions) in which both input operands are n-digit in length, models generalize successfully on unseen n-digit inputs (in-distribution (ID) generalization), but fail miserably on longer, unseen cases (out-of-distribution (OOD) generalization). We bring this unexplained performance drop into attention and ask whether there is systematic OOD generalization. Towards understanding LLMs, we train various smaller language models which may share the same underlying mechanism. We discover that the strong ID generalization stems from structured representations, while behind the unsatisfying OOD performance, the models still exhibit clear learned algebraic structures. Specifically, these models map unseen OOD inputs to outputs with learned equivalence relations in the ID domain, which we call the equivalence generalization. These findings deepen our knowledge regarding the generalizability of generative models including LLMs, and provide insights into potential avenues for improvement.

Related papers

Analyzing the Inner Workings of Transformers in Compositional Generalization [15.599899071518545]
We investigate the inner workings of a Transformer model by finding an existing subnetwork that contributes to the generalization performance. We find that the model depends on syntactic features to output the correct answer, but that the subnetwork with much better generalization performance than the whole model relies on a non-compositional algorithm.
arXiv Detail & Related papers (2025-02-21T08:07:53Z)
Compositional Generalization Requires More Than Disentangled Representations [5.762286612061953]
compositional generalization remains a key challenge for deep learning. Many generative models fail to recognize and compose factors to generate out-of-distribution (OOD) samples. We show that models forced-through architectural modifications with regularization or curated training data-can be highly data-efficient and effective at learning to compose in OOD regions.
arXiv Detail & Related papers (2025-01-30T23:20:41Z)
Out-of-distribution generalization via composition: a lens through induction heads in Transformers [0.46085106405479537]
Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data -- which is known as out-of-distribution (OOD) generalization. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning.
arXiv Detail & Related papers (2024-08-18T14:52:25Z)
Learning Divergence Fields for Shift-Robust Graph Representations [73.11818515795761]
In this work, we propose a geometric diffusion model with learnable divergence fields for the challenging problem with interdependent data. We derive a new learning objective through causal inference, which can guide the model to learn generalizable patterns of interdependence that are insensitive across domains.
arXiv Detail & Related papers (2024-06-07T14:29:21Z)
Unveiling the Generalization Power of Fine-Tuned Large Language Models [81.70754292058258]
We investigate whether fine-tuning affects the intrinsic generalization ability intrinsic to Large Language Models (LLMs) Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks. We observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model's generalization ability.
arXiv Detail & Related papers (2024-03-14T08:18:59Z)
Generalization Through the Lens of Learning Dynamics [11.009483845261958]
A machine learning (ML) system must learn to generalize to novel situations in order to yield accurate predictions at deployment. The impressive generalization performance of deep neural networks has stymied theoreticians. This thesis will study the learning dynamics of deep neural networks in both supervised and reinforcement learning tasks.
arXiv Detail & Related papers (2022-12-11T00:07:24Z)
On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning. We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z)
Exploring Length Generalization in Large Language Models [46.417433724786854]
The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks. We show that naively finetuning transformers on length generalization tasks shows significant generalization deficiencies independent of model scale. We then show that combining pretrained large language models' in-context learning abilities with scratchpad prompting results in a dramatic improvement in length generalization.
arXiv Detail & Related papers (2022-07-11T14:24:38Z)
Towards a Theoretical Framework of Out-of-Distribution Generalization [28.490842160921805]
Generalization to out-of-distribution (OOD) data, or domain generalization, is one of the central problems in modern machine learning. In this work, we take the first step towards rigorous and quantitative definitions of what is OOD; and what does it mean by saying an OOD problem is learnable.
arXiv Detail & Related papers (2021-06-08T16:32:23Z)
Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features. This simplicity bias can explain their lack of robustness out of distribution (OOD) We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z)
Improving Compositional Generalization in Semantic Parsing [54.4720965813889]
Generalization of models to out-of-distribution (OOD) data has captured tremendous attention recently. We investigate compositional generalization in semantic parsing, a natural test-bed for compositional generalization.
arXiv Detail & Related papers (2020-10-12T12:34:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.