AIDE: Task-Specific Fine Tuning with Attribute Guided Multi-Hop Data Expansion
- URL: http://arxiv.org/abs/2412.06136v1
- Date: Mon, 09 Dec 2024 01:39:16 GMT
- Title: AIDE: Task-Specific Fine Tuning with Attribute Guided Multi-Hop Data Expansion
- Authors: Jiayu Li, Xuan Zhu, Fang Liu, Yanjun Qi,
- Abstract summary: Fine-tuning large language models (LLMs) for specific tasks requires high-quality, diverse training data relevant to the task.<n>Recent research has leveraged LLMs to synthesize training data, but existing approaches either depend on large seed datasets or struggle to ensure both task relevance and data diversity in the generated outputs.<n>We propose AIDE, a novel data synthesis framework that uses a multi-hop process to expand 10 seed data points while ensuring diversity and task relevance.
- Score: 15.916595953695603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning large language models (LLMs) for specific tasks requires high-quality, diverse training data relevant to the task. Recent research has leveraged LLMs to synthesize training data, but existing approaches either depend on large seed datasets or struggle to ensure both task relevance and data diversity in the generated outputs. To address these challenges, we propose AIDE, a novel data synthesis framework that uses a multi-hop process to expand 10 seed data points while ensuring diversity and task relevance. AIDE extracts the main topic and key knowledge attributes from the seed data to guide the synthesis process. In each subsequent hop, it extracts the topic and attributes from the newly generated data and continues guided synthesis. This process repeats for a total of K hops. To prevent irrelevant data generation as the hop depth increases, AIDE incorporates a residual connection mechanism and uses self-reflection to improve data quality. Our empirical results demonstrate that fine-tuning Mistral-7B, Llama-3.1-8B and Llama-3.2-3B with AIDE achieves more than 10% accuracy improvements over the base models across 13 tasks from 5 different benchmarks, while outperforming the models fine-tuned with state-of-the-art data synthesis methods like Evol-Instruct, DataTune and Prompt2Model.
Related papers
- AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs [24.511434195711267]
AQuilt is a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data.<n>We construct a dataset of 703k examples to train a powerful data synthesis model.
arXiv Detail & Related papers (2025-07-24T17:03:27Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation [71.46236155101032]
We propose Base-Refine, a synthetic data generation method that combines the diversity of base models with the quality of instruct-tuned models.
We show that fine-tuning with BARE-generated data achieves a 101% improvement over instruct-only data on GSM8K and a 18.4% improvement over SOTA methods on RAFT.
arXiv Detail & Related papers (2025-02-03T00:12:40Z) - Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai [5.670682861458055]
We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai.
We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context.
Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation.
arXiv Detail & Related papers (2024-11-23T07:50:59Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment [39.137060714048175]
We argue that enhancing diversity can improve the parallelizable yet isolated approach to synthesizing datasets.
We introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process.
Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset.
arXiv Detail & Related papers (2024-09-26T08:03:19Z) - What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.
Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.
We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.
Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z) - MDM: Advancing Multi-Domain Distribution Matching for Automatic Modulation Recognition Dataset Synthesis [35.07663680944459]
Deep learning technology has been successfully introduced into Automatic Modulation Recognition (AMR) tasks.
The success of deep learning is all attributed to the training on large-scale datasets.
In order to solve the problem of large amount of data, some researchers put forward the method of data distillation.
arXiv Detail & Related papers (2024-08-05T14:16:54Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.