LoBaSS: Gauging Learnability in Supervised Fine-tuning Data
- URL: http://arxiv.org/abs/2310.13008v1
- Date: Mon, 16 Oct 2023 07:26:24 GMT
- Title: LoBaSS: Gauging Learnability in Supervised Fine-tuning Data
- Authors: Haotian Zhou, Tingkai Liu, Qianli Ma, Jianbo Yuan, Pengfei Liu, Yang
You and Hongxia Yang
- Abstract summary: Supervised Fine-Tuning (SFT) serves as a crucial phase in aligning Large Language Models (LLMs) to specific task prerequisites.
We introduce a new dimension in SFT data selection: learnability.
We present the Loss Based SFT Data Selection (LoBaSS) method, utilizing data learnability as the principal criterion for the selection SFT data.
- Score: 64.27898739929734
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Supervised Fine-Tuning (SFT) serves as a crucial phase in aligning Large
Language Models (LLMs) to specific task prerequisites. The selection of
fine-tuning data profoundly influences the model's performance, whose principle
is traditionally grounded in data quality and distribution. In this paper, we
introduce a new dimension in SFT data selection: learnability. This new
dimension is motivated by the intuition that SFT unlocks capabilities acquired
by a LLM during the pretraining phase. Given that different pretrained models
have disparate capabilities, the SFT data appropriate for one may not suit
another. Thus, we introduce the term learnability to define the suitability of
data for effective learning by the model. We present the Loss Based SFT Data
Selection (LoBaSS) method, utilizing data learnability as the principal
criterion for the selection SFT data. This method provides a nuanced approach,
allowing the alignment of data selection with inherent model capabilities,
ensuring optimal compatibility and learning efficiency. In experimental
comparisons involving 7B and 13B models, our LoBaSS method is able to surpass
full-data fine-tuning at merely 6% of the total training data. When employing
16.7% of the data, LoBaSS harmonizes the model's capabilities across
conversational and mathematical domains, proving its efficacy and adaptability.
Related papers
- Multimodal Preference Data Synthetic Alignment with Reward Model [23.978820500281213]
We propose a new framework in generating synthetic data using a reward model as a proxy of human preference for effective multimodal alignment with DPO training.
Experiment results indicate that integrating selected synthetic data, such as from generative and rewards models can effectively reduce reliance on human-annotated data.
arXiv Detail & Related papers (2024-12-23T09:29:40Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO.
Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically.
Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z) - Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive [15.066029556877721]
We show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples.
We design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode.
Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks.
arXiv Detail & Related papers (2024-02-20T18:42:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.