GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
- URL: http://arxiv.org/abs/2505.20416v1
- Date: Mon, 26 May 2025 18:06:50 GMT
- Title: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
- Authors: Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, Nanqing Dong,
- Abstract summary: Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data.<n>Existing approaches suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs.<n>We introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios.
- Score: 41.31575016578663
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.
Related papers
- Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z) - GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models [75.25348392263676]
Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP)<n>We propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation.
arXiv Detail & Related papers (2025-05-26T08:18:33Z) - Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation [75.9865035064794]
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information.<n>Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system.<n>We propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase.
arXiv Detail & Related papers (2025-05-22T05:15:27Z) - Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks [10.562940259841623]
This paper presents a novel approach for enhancing Large Language Models (LLMs) in knowledge-intensive QA tasks.<n>The proposed system includes an automated QA generator and a model fine-tuner, evaluated using perplexity, ROUGE, BLEU, and BERTScore.<n>Experiments demonstrate improvements in logical coherence and factual accuracy, with implications for developing adaptable Artificial Intelligence (AI) systems.
arXiv Detail & Related papers (2025-05-20T11:16:29Z) - OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.390060529534644]
We propose OS-Genesis, a novel data synthesis pipeline for Graphical User Interface (GUI) agents.<n>Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions.<n>We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks.
arXiv Detail & Related papers (2024-12-27T16:21:58Z) - Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch [54.12139707822201]
We propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method.<n>By generating diverse questions from scratch, we produce a dataset of 1 million problem-solution pairs.<n>Our experiments demonstrate that models trained on our data outperform existing open-source datasets.
arXiv Detail & Related papers (2024-10-24T12:42:04Z) - Harnessing the Power of Large Language Model for Uncertainty Aware Graph Processing [24.685942503019948]
We introduce a novel approach that harnesses the power of a large language model (LLM) to provide a confidence score on the generated answer.
We experiment with our approach on two graph processing tasks: few-shot knowledge graph completion and graph classification.
Our confidence measure achieves an AUC of 0.8 or higher on seven out of the ten datasets in predicting the correctness of the answer generated by LLM.
arXiv Detail & Related papers (2024-03-31T07:38:39Z) - Automatic Question-Answer Generation for Long-Tail Knowledge [65.11554185687258]
We propose an automatic approach to generate specialized QA datasets for tail entities.
We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets.
arXiv Detail & Related papers (2024-03-03T03:06:31Z) - GenQ: Quantization in Low Data Regimes with Generative Synthetic Data [28.773641633757283]
We introduce GenQ, a novel approach employing an advanced Generative AI model to generate high-resolution synthetic data.
In case of limited data availability, the actual data is used to guide the synthetic data generation process.
Through rigorous experimentation, GenQ establishes new benchmarks in data-free and data-scarce quantization.
arXiv Detail & Related papers (2023-12-07T23:31:42Z) - Exploring the Viability of Synthetic Query Generation for Relevance
Prediction [18.77909480819682]
We conduct a study into how QGen approaches can be leveraged for nuanced relevance prediction.
We identify new shortcomings of existing QGen approaches -- including their inability to distinguish between different grades of relevance.
We introduce label-grained QGen models which incorporates knowledge about the different relevance.
arXiv Detail & Related papers (2023-05-19T18:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.