Related papers: The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning

The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning

URL: http://arxiv.org/abs/2406.11721v2
Date: Mon, 07 Apr 2025 14:21:36 GMT
Title: The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning
Authors: Bingxiang He, Ning Ding, Cheng Qian, Jia Deng, Ganqu Cui, Lifan Yuan, Haiwen Hong, Huan-ang Gao, Longtao Huang, Hui Xue, Huimin Chen, Zhiyuan Liu, Maosong Sun,
Abstract summary: We show that zero-shot generalization happens very early during instruction tuning.<n>We propose a more grounded training data arrangement framework, Test-centric Multi-turn Arrangement.
Score: 86.19804569376333
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. To bridge this gap, we investigate zero-shot generalization from the perspective of the data itself. We first demonstrate that zero-shot generalization happens very early during instruction tuning, with loss serving as a stable indicator. Next, we investigate training data arrangement through similarity and granularity perspectives, confirming that the timing of exposure to certain training examples may greatly facilitate generalization on unseen tasks. Finally, we propose a more grounded training data arrangement framework, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level. Our code is released at https://github.com/thunlp/Dynamics-of-Zero-Shot-Generalization.

Related papers

Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test [19.213961869113188]
We conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE.<n>Our study, for the first time, verifies that grokking still happens in the pretraining of large-scale foundation models.<n>We develop two novel metrics to quantify pathway distance and the complexity of a single pathway.
arXiv Detail & Related papers (2025-06-26T17:59:58Z)
Generalization Capability for Imitation Learning [1.30536490219656]
Imitation learning holds the promise of equipping robots with versatile skills by learning from expert demonstrations. However, policies trained on finite datasets often struggle to generalize beyond the training distribution. We present a unified perspective on the generalization capability of imitation learning, grounded in both information theorey and data distribution property.
arXiv Detail & Related papers (2025-04-25T17:59:59Z)
Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation [1.3586572110652484]
Few-shot class-incremental learning addresses challenges arising from limited incoming data. We propose supervised contrastive learning to refine the representation space, enhancing discriminative power and leading to better generalization.
arXiv Detail & Related papers (2024-07-27T14:16:25Z)
Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models [79.28821338925947]
Domain-Class Incremental Learning is a realistic but challenging continual learning scenario. To handle these diverse tasks, pre-trained Vision-Language Models (VLMs) are introduced for their strong generalizability. This incurs a new problem: the knowledge encoded in the pre-trained VLMs may be disturbed when adapting to new tasks, compromising their inherent zero-shot ability. Existing methods tackle it by tuning VLMs with knowledge distillation on extra datasets, which demands heavy overhead. We propose the Distribution-aware Interference-free Knowledge Integration (DIKI) framework, retaining pre-trained knowledge of
arXiv Detail & Related papers (2024-07-07T12:19:37Z)
Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation [14.225723195634941]
We propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques.
arXiv Detail & Related papers (2024-07-03T12:24:40Z)
On the Generalization Ability of Unsupervised Pretraining [53.06175754026037]
Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. This paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase. Our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
arXiv Detail & Related papers (2024-03-11T16:23:42Z)
Closing the Gap between TD Learning and Supervised Learning -- A Generalisation Point of View [51.30152184507165]
Some reinforcement learning (RL) algorithms can stitch pieces of experience to solve a task never seen before during training. This oft-sought property is one of the few ways in which RL methods based on dynamic-programming differ from RL methods based on supervised-learning (SL) It remains unclear whether those methods forgo this important stitching property.
arXiv Detail & Related papers (2024-01-20T14:23:25Z)
Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality [55.88910947643436]
Self-supervised pre-training is essential for handling vast quantities of unlabeled data in practice. HiDe-Prompt is an innovative approach that explicitly optimize the hierarchical components with an ensemble of task-specific prompts and statistics. Our experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning.
arXiv Detail & Related papers (2023-10-11T06:51:46Z)
Instruction Position Matters in Sequence Generation with Large Language Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization. We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z)
Self-regulating Prompts: Foundational Model Adaptation without Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC. PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z)
Leveraging Time Irreversibility with Order-Contrastive Pre-training [3.1848820580333737]
We explore an "order-contrastive" method for self-supervised pre-training on longitudinal data. We prove a finite-sample guarantee for the downstream error of a representation learned with order-contrastive pre-training. Our results indicate that pre-training methods designed for particular classes of distributions and downstream tasks can improve the performance of self-supervised learning.
arXiv Detail & Related papers (2021-11-04T02:56:52Z)
Towards the Generalization of Contrastive Self-Supervised Learning [11.889992921445849]
We present a theoretical explanation of how contrastive self-supervised pre-trained models generalize to downstream tasks. We further explore SimCLR and Barlow Twins, which are two canonical contrastive self-supervised methods.
arXiv Detail & Related papers (2021-11-01T07:39:38Z)
Explaining generalization in deep learning: progress and fundamental limits [8.299945169799795]
In the first part of the thesis, we will empirically study how training deep networks via gradient descent implicitly controls the networks' capacity. We will then derive em data-dependent em uniform-convergence-based generalization bounds with improved dependencies on the parameter count. In the last part of the thesis, we will introduce an em empirical technique to estimate generalization using unlabeled data.
arXiv Detail & Related papers (2021-10-17T21:17:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.