Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
- URL: http://arxiv.org/abs/2504.15573v1
- Date: Tue, 22 Apr 2025 04:07:13 GMT
- Title: Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
- Authors: Yuxin Jiang, Yufei Wang, Chuhan Wu, Xinyi Dai, Yan Xu, Weinan Gan, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Wei Wang,
- Abstract summary: Web Reconstruction (WebR) is a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents.<n>We show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks.
- Score: 83.0216122783429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The improvement of LLMs' instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm--Web as Instruction and Web as Response--where each web document is designated as either an instruction or a response to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort. The data and code are publicly available at https://github.com/YJiangcm/WebR.
Related papers
- Synthetic Data Generation Using Large Language Models: Advances in Text and Code [0.0]
Large language models (LLMs) have unlocked new possibilities for generating synthetic training data in both natural language and code.<n>We show how these methods enrich low-resource tasks such as classification and question answering.<n>We address challenges like factual inaccuracies in generated text, lack of stylistic realism, and the risk of bias amplification.
arXiv Detail & Related papers (2025-03-18T08:34:03Z) - AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale.<n>We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials.<n>Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z) - Enhancing Document AI Data Generation Through Graph-Based Synthetic Layouts [0.8245350546263803]
We propose a novel approach to synthetic document layout generation using Graph Neural Networks (GNNs)<n>By representing document elements as nodes in a graph, GNNs are trained to generate realistic and diverse document layouts.<n>Our experimental results show that graph-augmented document layouts outperform existing augmentation techniques.
arXiv Detail & Related papers (2024-11-27T21:15:02Z) - Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.<n>We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.<n>Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z) - Towards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions [17.96479268328824]
We investigate the feasibility of creating realistic, large-scale synthetic datasets of user-generated content.
We propose a multi-step generation process, predicated on the idea of creating compact representations of discussion threads.
arXiv Detail & Related papers (2024-08-15T18:43:50Z) - Better Alignment with Instruction Back-and-Forth Translation [120.19298407990267]
We propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge.
Given documents from a web corpus, we generate and curate synthetic instructions using the backtranslation approach proposed by Li et al.
We find that instruction back-and-forth translation combines the best of both worlds -- making use of the information diversity and quantity found on the web, while ensuring the quality of the responses which is necessary for effective alignment.
arXiv Detail & Related papers (2024-08-08T17:42:32Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - Multi-modal Learning for WebAssembly Reverse Engineering [7.18491643197374]
We present WasmRev, the first multi-modal pre-trained language model for WebAssembly reverse engineering.
WasmRev is pre-trained using self-supervised learning on a large-scale multi-modal corpus.
We fine-tune WasmRev onto three important reverse engineering tasks: type recovery, function purpose identification and WebAssembly summarization.
arXiv Detail & Related papers (2024-04-04T03:03:38Z) - The Klarna Product Page Dataset: Web Element Nomination with Graph
Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety.
We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task.
Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page.
Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.