Related papers: Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity

Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity

URL: http://arxiv.org/abs/2406.11721v1
Date: Mon, 17 Jun 2024 16:40:21 GMT
Title: Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity
Authors: Bingxiang He, Ning Ding, Cheng Qian, Jia Deng, Ganqu Cui, Lifan Yuan, Huan-ang Gao, Huimin Chen, Zhiyuan Liu, Maosong Sun,
Abstract summary: We show that zero-shot generalization during instruction tuning happens very early. We also show that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level.
Score: 84.12126298229866
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. This line of research has been limited to examining transfer between tasks from a task-pair perspective, with few studies focusing on understanding zero-shot generalization from the perspective of the data itself. To bridge this gap, we first demonstrate through multiple metrics that zero-shot generalization during instruction tuning happens very early. Next, we investigate the facilitation of zero-shot generalization from both data similarity and granularity perspectives, confirming that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization. Finally, we propose a more grounded training data arrangement method, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level. We hope our analysis will advance the understanding of zero-shot generalization during instruction tuning and contribute to the development of more aligned LLMs. Our code is released at https://github.com/HBX-hbx/dynamics_of_zero-shot_generalization.

Related papers

Generalist++: A Meta-learning Framework for Mitigating Trade-off in Adversarial Training [105.74524789405514]
adversarial training (AT) is currently the most effective defense against neural networks.<n>We propose to partition the overall generalization goal into multiple sub-tasks, each assigned to a dedicated base learner.<n>In the later stages of training, we interpolate their parameters to form a knowledgeable global learner.<n>We term this framework Generalist and introduce three variants tailored to different application scenarios.
arXiv Detail & Related papers (2025-10-15T09:47:54Z)
Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test [19.213961869113188]
We conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE.<n>Our study, for the first time, verifies that grokking still happens in the pretraining of large-scale foundation models.<n>We develop two novel metrics to quantify pathway distance and the complexity of a single pathway.
arXiv Detail & Related papers (2025-06-26T17:59:58Z)
Generalization Capability for Imitation Learning [1.30536490219656]
Imitation learning holds the promise of equipping robots with versatile skills by learning from expert demonstrations. However, policies trained on finite datasets often struggle to generalize beyond the training distribution. We present a unified perspective on the generalization capability of imitation learning, grounded in both information theorey and data distribution property.
arXiv Detail & Related papers (2025-04-25T17:59:59Z)
Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation [1.3586572110652484]
Few-shot class-incremental learning addresses challenges arising from limited incoming data. We propose supervised contrastive learning to refine the representation space, enhancing discriminative power and leading to better generalization.
arXiv Detail & Related papers (2024-07-27T14:16:25Z)
Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models [79.28821338925947]
Domain-Class Incremental Learning is a realistic but challenging continual learning scenario. To handle these diverse tasks, pre-trained Vision-Language Models (VLMs) are introduced for their strong generalizability. This incurs a new problem: the knowledge encoded in the pre-trained VLMs may be disturbed when adapting to new tasks, compromising their inherent zero-shot ability. Existing methods tackle it by tuning VLMs with knowledge distillation on extra datasets, which demands heavy overhead. We propose the Distribution-aware Interference-free Knowledge Integration (DIKI) framework, retaining pre-trained knowledge of
arXiv Detail & Related papers (2024-07-07T12:19:37Z)
Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation [14.225723195634941]
We propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques.
arXiv Detail & Related papers (2024-07-03T12:24:40Z)
On the Generalization Ability of Unsupervised Pretraining [53.06175754026037]
Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. This paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase. Our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
arXiv Detail & Related papers (2024-03-11T16:23:42Z)
Closing the Gap between TD Learning and Supervised Learning -- A Generalisation Point of View [51.30152184507165]
Some reinforcement learning (RL) algorithms can stitch pieces of experience to solve a task never seen before during training. This oft-sought property is one of the few ways in which RL methods based on dynamic-programming differ from RL methods based on supervised-learning (SL) It remains unclear whether those methods forgo this important stitching property.
arXiv Detail & Related papers (2024-01-20T14:23:25Z)
Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality [55.88910947643436]
Self-supervised pre-training is essential for handling vast quantities of unlabeled data in practice. HiDe-Prompt is an innovative approach that explicitly optimize the hierarchical components with an ensemble of task-specific prompts and statistics. Our experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning.
arXiv Detail & Related papers (2023-10-11T06:51:46Z)
Instruction Position Matters in Sequence Generation with Large Language Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization. We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z)
Self-regulating Prompts: Foundational Model Adaptation without Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC. PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z)
Leveraging Time Irreversibility with Order-Contrastive Pre-training [3.1848820580333737]
We explore an "order-contrastive" method for self-supervised pre-training on longitudinal data. We prove a finite-sample guarantee for the downstream error of a representation learned with order-contrastive pre-training. Our results indicate that pre-training methods designed for particular classes of distributions and downstream tasks can improve the performance of self-supervised learning.
arXiv Detail & Related papers (2021-11-04T02:56:52Z)
Towards the Generalization of Contrastive Self-Supervised Learning [11.889992921445849]
We present a theoretical explanation of how contrastive self-supervised pre-trained models generalize to downstream tasks. We further explore SimCLR and Barlow Twins, which are two canonical contrastive self-supervised methods.
arXiv Detail & Related papers (2021-11-01T07:39:38Z)
Explaining generalization in deep learning: progress and fundamental limits [8.299945169799795]
In the first part of the thesis, we will empirically study how training deep networks via gradient descent implicitly controls the networks' capacity. We will then derive em data-dependent em uniform-convergence-based generalization bounds with improved dependencies on the parameter count. In the last part of the thesis, we will introduce an em empirical technique to estimate generalization using unlabeled data.
arXiv Detail & Related papers (2021-10-17T21:17:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.