Exploring Length Generalization in Large Language Models
- URL: http://arxiv.org/abs/2207.04901v1
- Date: Mon, 11 Jul 2022 14:24:38 GMT
- Title: Exploring Length Generalization in Large Language Models
- Authors: Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra,
Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, Behnam Neyshabur
- Abstract summary: The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks.
We show that naively finetuning transformers on length generalization tasks shows significant generalization deficiencies independent of model scale.
We then show that combining pretrained large language models' in-context learning abilities with scratchpad prompting results in a dramatic improvement in length generalization.
- Score: 46.417433724786854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to extrapolate from short problem instances to longer ones is an
important form of out-of-distribution generalization in reasoning tasks, and is
crucial when learning from datasets where longer problem instances are rare.
These include theorem proving, solving quantitative mathematics problems, and
reading/summarizing novels. In this paper, we run careful empirical studies
exploring the length generalization capabilities of transformer-based language
models. We first establish that naively finetuning transformers on length
generalization tasks shows significant generalization deficiencies independent
of model scale. We then show that combining pretrained large language models'
in-context learning abilities with scratchpad prompting (asking the model to
output solution steps before producing an answer) results in a dramatic
improvement in length generalization. We run careful failure analyses on each
of the learning modalities and identify common sources of mistakes that
highlight opportunities in equipping language models with the ability to
generalize to longer problems.
Related papers
- Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
Large language models (LLMs) have sparked debate over whether they genuinely generalize to unseen tasks or rely on memorizing vast amounts of pretraining data.
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency.
This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Towards Understanding the Relationship between In-context Learning and Compositional Generalization [7.843029855730508]
We train a causal Transformer in a setting that renders ordinary learning very difficult.
The model can solve the task, however, by utilizing earlier examples to generalize to later ones.
In evaluations on the datasets, SCAN, COGS, and GeoQuery, models trained in this manner indeed show improved compositional generalization.
arXiv Detail & Related papers (2024-03-18T14:45:52Z) - Unveiling the Generalization Power of Fine-Tuned Large Language Models [81.70754292058258]
We investigate whether fine-tuning affects the intrinsic generalization ability intrinsic to Large Language Models (LLMs)
Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks.
We observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model's generalization ability.
arXiv Detail & Related papers (2024-03-14T08:18:59Z) - It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models [6.065846799248359]
Large language models (LLMs) have achieved remarkable proficiency on solving diverse problems.
However, their generalization ability is not always satisfying and the generalization problem is common for generative transformer models in general.
We show that when training models on n-digit operations, models generalize successfully on unseen n-digit inputs, but fail miserably on longer, unseen cases.
arXiv Detail & Related papers (2023-08-16T10:09:42Z) - Language Model Behavior: A Comprehensive Survey [5.663056267168211]
We discuss over 250 recent studies of English language model behavior before task-specific fine-tuning.
Despite dramatic increases in generated text quality as models scale to hundreds of billions of parameters, the models are still prone to unfactual responses, commonsense errors, memorized text, and social biases.
arXiv Detail & Related papers (2023-03-20T23:54:26Z) - Lexical Generalization Improves with Larger Models and Longer Training [42.024050065980845]
We analyze the use of lexical overlaps in natural language inference, paraphrase detection, and reading comprehension.
We find that larger models are much less susceptible to adopting lexical overlaps.
arXiv Detail & Related papers (2022-10-23T09:20:11Z) - Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought.
Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - Learning to Generalize for Sequential Decision Making [19.075378799280728]
We introduce a teacher-student imitation learning methodology and a means of converting a reinforcement learning model into a natural language understanding model.
We show that models can learn faster and generalize more, leveraging both the imitation learning and the reformulation.
arXiv Detail & Related papers (2020-10-05T18:00:03Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.