Labeling supervised fine-tuning data with the scaling law
- URL: http://arxiv.org/abs/2405.02817v2
- Date: Fri, 16 Aug 2024 05:52:17 GMT
- Title: Labeling supervised fine-tuning data with the scaling law
- Authors: Huanjun Kong,
- Abstract summary: This paper introduces a multi-stage manual annotation by the scaling law, offering a high-quality Supervised Fine-Tuning data acquisition method.
We have preprocessed 58k authentic chat data and manually annotated 2.3k questions.
We conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters, and the optimal version improved 29.07 in F1 score.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces a multi-stage manual annotation calibrated by the scaling law, offering a high-quality Supervised Fine-Tuning data acquisition method for environments with constrained resources like GPU poor, limited GPT access, and funding restrictions. We have preprocessed 58k authentic chat data and manually annotated 2.3k questions. After this, we conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters. The optimal version improved 29.07 in F1 score. This confirms the viability of fine-tuning Large Language Model (LLM) for downstream Natural Language Processing (NLP) tasks. Our contributions are: 1) Created Supervised Fine-Tuning (SFT) training data in alpaca format, along with a set of Low-Rank Adaptation (LoRA) weights, and 2) Developed a method for acquiring high-quality data leveraging scaling law principle. The script, raw data with alpaca format and experiments track are open-sourced on Github (https://github.com/InternLM/HuixiangDou/tree/main/web/tools), HuggingFace (https://huggingface.co/tpoisonooo) and WandB (https://wandb.ai/tpoisonooo/huixiangdou-cr/table?nw=nwusertpoisonooo). The privacy of the data involved has been authorized by users. SFT data and license comes from ncnn contributors group.
Related papers
- Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets [5.8465717270452195]
We show how scaling law derivation can be used for model and dataset comparison.<n>For the first time, full scaling laws are derived for two important language-vision learning procedures, CLIP and MaMMUT.<n>We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule.
arXiv Detail & Related papers (2025-06-05T03:35:59Z) - DataDecide: How to Predict Best Pretraining Data with Small Experiments [67.95896457895404]
We release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale.
We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds.
arXiv Detail & Related papers (2025-04-15T17:02:15Z) - Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting [15.251425165987987]
Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities.
We propose a sample weighting scheme for the fine-tuning data based on the pre-trained model's losses.
We empirically demonstrate the efficacy of our method on both language and vision tasks.
arXiv Detail & Related papers (2025-02-05T00:49:59Z) - Leveraging Model Guidance to Extract Training Data from Personalized Diffusion Models [27.90276817753197]
Diffusion Models (DMs) have become powerful image generation tools.<n>Many people upload fine-tuned checkpoints online, fostering communities such as Civitai and HuggingFace.<n>We ask: "Can training data be extracted from these fine-tuned DMs shared online?"<n>We propose FineXtract, a framework for extracting fine-tuning data.
arXiv Detail & Related papers (2024-10-03T23:06:11Z) - Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We present a framework that selects high-quality pretraining data without any LLM training of our own.
We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations.
Our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM.
arXiv Detail & Related papers (2024-09-09T17:23:29Z) - Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns.
We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions.
Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z) - OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data [0.0]
In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models.
We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model.
We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme.
arXiv Detail & Related papers (2024-04-18T13:57:18Z) - Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies [19.100334346597982]
We analyze how model size, dataset size, and synthetic data quality affect robustness by developing the first scaling laws for adversarial training.
Our scaling laws reveal inefficiencies in prior art and provide actionable feedback to advance the field.
arXiv Detail & Related papers (2024-04-14T20:14:38Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Zephyr: Direct Distillation of LM Alignment [59.03530095974505]
We aim to produce a smaller language model that is aligned to user intent.
Previous research has shown that applying supervised fine-tuning (dSFT) on larger models significantly improves task accuracy.
We apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment.
arXiv Detail & Related papers (2023-10-25T19:25:16Z) - OpenChat: Advancing Open-source Language Models with Mixed-Quality Data [29.938434364765534]
We present a novel framework, named OpenChat, to advance open-source language models with mixed-quality data.
We propose the C(onditioned)-RLFT, which regards different data sources as coarse-grained reward labels and learns a class-conditioned policy.
Our openchat-13b fine-tuned with C-RLFT achieves the highest average performance among all 13b open-source language models.
arXiv Detail & Related papers (2023-09-20T11:54:40Z) - Boosting Commit Classification with Contrastive Learning [0.8655526882770742]
Commit Classification (CC) is an important task in software maintenance.
We propose a contrastive learning-based commit classification framework.
Our framework can solve the CC problem simply but effectively in fewshot scenarios.
arXiv Detail & Related papers (2023-08-16T10:02:36Z) - UnrealPerson: An Adaptive Pipeline towards Costless Person
Re-identification [102.58619642363959]
This paper presents UnrealPerson, a novel pipeline that makes full use of unreal image data to decrease the costs in both the training and deployment stages.
With 3,000 IDs and 120,000 instances, our method achieves a 38.5% rank-1 accuracy when being directly transferred to MSMT17.
arXiv Detail & Related papers (2020-12-08T08:15:30Z) - Improving Semi-supervised Federated Learning by Reducing the Gradient
Diversity of Models [67.66144604972052]
Federated learning (FL) is a promising way to use the computing power of mobile devices while maintaining privacy of users.
We show that a critical issue that affects the test accuracy is the large gradient diversity of the models from different users.
We propose a novel grouping-based model averaging method to replace the FedAvg averaging method.
arXiv Detail & Related papers (2020-08-26T03:36:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.