AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs
- URL: http://arxiv.org/abs/2507.18584v1
- Date: Thu, 24 Jul 2025 17:03:27 GMT
- Title: AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs
- Authors: Xiaopeng Ke, Hexuan Deng, Xuebo Liu, Jun Rao, Zhenxi Song, Jun Yu, Min Zhang,
- Abstract summary: AQuilt is a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data.<n>We construct a dataset of 703k examples to train a powerful data synthesis model.
- Score: 24.511434195711267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the impressive performance of large language models (LLMs) in general domains, they often underperform in specialized domains. Existing approaches typically rely on data synthesis methods and yield promising results by using unlabeled data to capture domain-specific features. However, these methods either incur high computational costs or suffer from performance limitations, while also demonstrating insufficient generalization across different tasks. To address these challenges, we propose AQuilt, a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data, including Answer, Question, Unlabeled data, Inspection, Logic, and Task type. By incorporating logic and inspection, we encourage reasoning processes and self-inspection to enhance model performance. Moreover, customizable task instructions enable high-quality data generation for any task. As a result, we construct a dataset of 703k examples to train a powerful data synthesis model. Experiments show that AQuilt is comparable to DeepSeek-V3 while utilizing just 17% of the production cost. Further analysis demonstrates that our generated data exhibits higher relevance to downstream tasks. Source code, models, and scripts are available at https://github.com/Krueske/AQuilt.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Efficacy of Synthetic Data as a Benchmark [3.2968976262860408]
We investigate the effectiveness of generating synthetic data through large language models (LLMs)
Our experiments show that while synthetic data can effectively capture performance of various methods for simpler tasks, it falls short for more complex tasks like named entity recognition.
We propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks.
arXiv Detail & Related papers (2024-09-18T13:20:23Z) - CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed.<n>We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA)<n>Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - API-guided Dataset Synthesis to Finetune Large Code Models [13.835835256858653]
Large code models (LCMs) are pre-trained on vast code corpora.
Supervised fine-tuning (SFT) plays a vital role in aligning these models with specific requirements.
High-quality SFT datasets pose a significant challenge due to the uneven quality of datasets and the scarcity of domain-specific datasets.
We propose DataScope, an API-guided dataset synthesis framework.
arXiv Detail & Related papers (2024-08-15T14:48:42Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - TAIA: Large Language Models are Out-of-Distribution Data Learners [30.57872423927015]
We propose an effective inference-time intervention method: Training All parameters but Inferring with only Attention (trainallInfAttn)
trainallInfAttn achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios.
The high tolerance of trainallInfAttn to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data.
arXiv Detail & Related papers (2024-05-30T15:57:19Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Making Large Language Models Better Data Creators [22.0882632635255]
Large language models (LLMs) have advanced the state-of-the-art in NLP significantly.
deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security.
We propose a unified data creation pipeline that requires only a single format example.
arXiv Detail & Related papers (2023-10-31T01:08:34Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - FairGen: Fair Synthetic Data Generation [0.3149883354098941]
We propose a pipeline to generate fairer synthetic data independent of the GAN architecture.
We claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples.
arXiv Detail & Related papers (2022-10-24T08:13:47Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.