Related papers: SeDi-Instruct: Enhancing Alignment of Language Models through Self-Directed Instruction Generation

SeDi-Instruct: Enhancing Alignment of Language Models through Self-Directed Instruction Generation

URL: http://arxiv.org/abs/2502.04774v1
Date: Fri, 07 Feb 2025 09:20:11 GMT
Title: SeDi-Instruct: Enhancing Alignment of Language Models through Self-Directed Instruction Generation
Authors: Jungwoo Kim, Minsang Kim, Sungjin Lee,
Abstract summary: We propose a novel data generation framework, Self-Direct Instruction generation (SeDi-Instruct), which employs diversity-based filtering and iterative feedback task generation.<n>SeDi-Instruct enhances the accuracy of AI models by 5.2%, compared with traditional methods, while reducing data generation costs by 36%.
Score: 7.066883955432192
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid evolution of Large Language Models (LLMs) has enabled the industry to develop various AI-based services. Instruction tuning is considered essential in adapting foundation models for target domains to provide high-quality services to customers. A key challenge in instruction tuning is obtaining high-quality instruction data. Self-Instruct, which automatically generates instruction data using ChatGPT APIs, alleviates the data scarcity problem. To improve the quality of instruction data, Self-Instruct discards many of the instructions generated from ChatGPT, even though it is inefficient in terms of cost owing to many useless API calls. To generate high-quality instruction data at a low cost, we propose a novel data generation framework, Self-Direct Instruction generation (SeDi-Instruct), which employs diversity-based filtering and iterative feedback task generation. Diversity-based filtering maintains model accuracy without excessively discarding low-quality generated instructions by enhancing the diversity of instructions in a batch. This reduces the cost of synthesizing instruction data. The iterative feedback task generation integrates instruction generation and training tasks and utilizes information obtained during the training to create high-quality instruction sets. Our results show that SeDi-Instruct enhances the accuracy of AI models by 5.2%, compared with traditional methods, while reducing data generation costs by 36%.

Related papers

Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z)
REInstruct: Building Instruction Data from Unlabeled Corpus [49.82314244648043]
We propose REInstruct, a method to automatically build instruction data from an unlabeled corpus. By training Llama-7b on a combination of 3k seed data and 32k synthetic data from REInstruct, fine-tuned model achieves a 65.41% win rate on AlpacaEval leaderboard against text-davinci-003.
arXiv Detail & Related papers (2024-08-20T09:05:03Z)
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models [54.14602121129874]
We introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification.
arXiv Detail & Related papers (2024-06-19T13:29:53Z)
Mosaic-IT: Free Compositional Data Augmentation Improves Instruction Tuning [30.82220015525281]
Mosaic Instruction Tuning (Mosaic-IT) is a human/model-free compositional data augmentation method. Mosaic-IT randomly creates rich and diverse augmentations from existing instruction tuning data. Our evaluations demonstrate a superior performance and training efficiency of Mosaic-IT.
arXiv Detail & Related papers (2024-05-22T04:08:20Z)
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator [63.762209407570715]
Genixer is a comprehensive data generation pipeline consisting of four key steps. A synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data.
arXiv Detail & Related papers (2023-12-11T09:44:41Z)
Safer-Instruct: Aligning Language Models with Automated Preference Data [20.177660013450176]
Reinforcement learning from human feedback is a vital strategy for enhancing model capability in language models. We present Safer-Instruct, a novel pipeline for automatically constructing large-scale preference data. Our approach leverages reversed instruction tuning, instruction induction, and expert model evaluation to efficiently generate high-quality preference data.
arXiv Detail & Related papers (2023-11-15T04:22:22Z)
DoG-Instruct: Towards Premium Instruction-Tuning Data via Text-Grounded Instruction Wrapping [41.89443082174044]
This paper proposes a scalable solution to the problem of finding high-quality instruction-response pairs. It involves training LLMs to generate pairs based on human-written documents, rather than relying solely on self-generation without context. Our proposed method not only exploits the advantages of human-written documents in reducing hallucinations but also utilizes an LLM to wrap the expression of documents.
arXiv Detail & Related papers (2023-09-11T13:41:18Z)
Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models [32.41573520305861]
We explore alternative approaches to generate high-quality instruction data that do not rely on closed-source models. Evaluation results from two benchmarks and the GPT-4 model demonstrate the effectiveness of our generated instruction data.
arXiv Detail & Related papers (2023-08-24T11:07:47Z)
STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances. We design fine-grained step-by-step instructions to obtain the initial data instances. Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z)
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation [92.2167864437497]
We propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data. Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions. By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions; 2) it provides high-quality data for instruction tuning; and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available.
arXiv Detail & Related papers (2023-05-23T17:56:26Z)
Self-Instruct: Aligning Language Models with Self-Generated Instructions [76.42871502364697]
Self-Instruct is a framework for improving the instruction-following capabilities of pretrained language models. Our pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model. For further evaluation, we curate a set of expert-written instructions for novel tasks, and show through human evaluation that tuning GPT3 with Self-Instruct outperforms using existing public instruction datasets by a large margin.
arXiv Detail & Related papers (2022-12-20T18:59:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.