GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
- URL: http://arxiv.org/abs/2505.20416v1
- Date: Mon, 26 May 2025 18:06:50 GMT
- Title: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
- Authors: Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, Nanqing Dong,
- Abstract summary: Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data.<n>Existing approaches suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs.<n>We introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios.
- Score: 41.31575016578663
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.
Related papers
- Unlocking Advanced Graph Machine Learning Insights through Knowledge Completion on Neo4j Graph Database [1.1059590443280725]
This paper proposes an innovative architecture that integrates a Knowledge Completion phase into GDB-GML applications.<n>We show how revealing hidden knowledge can heavily impact datasets' behavior and metrics.<n> Experimental results demonstrate that our intuition radically reshapes both topology and overall dataset dynamics.
arXiv Detail & Related papers (2025-11-14T15:27:31Z) - Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs [3.222543736797976]
SynthKGQA is a framework for generating high-quality synthetic Knowledge Graph Question Answering datasets from any Knowledge Graph.<n>We show how, in addition to enabling more informative benchmarking of KG retrievers, the data produced with SynthKGQA also allows us to train better models.
arXiv Detail & Related papers (2025-11-06T15:45:18Z) - G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge [88.82814893945077]
Large language models (LLMs) excel at complex reasoning but remain limited by static and incomplete parametric knowledge.<n>Recent graph-enhanced RAG (GraphRAG) attempts to bridge this gap by constructing tailored graphs and enabling LLMs to reason on them.<n>G-reasoner is a unified framework that integrates graph and language foundation models for reasoning over diverse graph-structured knowledge.
arXiv Detail & Related papers (2025-09-29T04:38:12Z) - Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching [61.824094419641575]
Large Language Models (LLMs) struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA)<n>We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures.<n>Existing methods usually employ resource-intensive, non-scalable reasoning on vanilla KGs, but overlook this gap.<n>We propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs' prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries.
arXiv Detail & Related papers (2025-09-25T06:48:52Z) - Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z) - GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models [75.25348392263676]
Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP)<n>We propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation.
arXiv Detail & Related papers (2025-05-26T08:18:33Z) - Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation [75.9865035064794]
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information.<n>Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system.<n>We propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase.
arXiv Detail & Related papers (2025-05-22T05:15:27Z) - Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks [10.562940259841623]
This paper presents a novel approach for enhancing Large Language Models (LLMs) in knowledge-intensive QA tasks.<n>The proposed system includes an automated QA generator and a model fine-tuner, evaluated using perplexity, ROUGE, BLEU, and BERTScore.<n>Experiments demonstrate improvements in logical coherence and factual accuracy, with implications for developing adaptable Artificial Intelligence (AI) systems.
arXiv Detail & Related papers (2025-05-20T11:16:29Z) - OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.390060529534644]
We propose OS-Genesis, a novel data synthesis pipeline for Graphical User Interface (GUI) agents.<n>Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions.<n>We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks.
arXiv Detail & Related papers (2024-12-27T16:21:58Z) - Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch [54.12139707822201]
We propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method.<n>By generating diverse questions from scratch, we produce a dataset of 1 million problem-solution pairs.<n>Our experiments demonstrate that models trained on our data outperform existing open-source datasets.
arXiv Detail & Related papers (2024-10-24T12:42:04Z) - Harnessing the Power of Large Language Model for Uncertainty Aware Graph Processing [24.685942503019948]
We introduce a novel approach that harnesses the power of a large language model (LLM) to provide a confidence score on the generated answer.
We experiment with our approach on two graph processing tasks: few-shot knowledge graph completion and graph classification.
Our confidence measure achieves an AUC of 0.8 or higher on seven out of the ten datasets in predicting the correctness of the answer generated by LLM.
arXiv Detail & Related papers (2024-03-31T07:38:39Z) - Automatic Question-Answer Generation for Long-Tail Knowledge [65.11554185687258]
We propose an automatic approach to generate specialized QA datasets for tail entities.
We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets.
arXiv Detail & Related papers (2024-03-03T03:06:31Z) - GenQ: Quantization in Low Data Regimes with Generative Synthetic Data [28.773641633757283]
We introduce GenQ, a novel approach employing an advanced Generative AI model to generate high-resolution synthetic data.
In case of limited data availability, the actual data is used to guide the synthetic data generation process.
Through rigorous experimentation, GenQ establishes new benchmarks in data-free and data-scarce quantization.
arXiv Detail & Related papers (2023-12-07T23:31:42Z) - Exploring the Viability of Synthetic Query Generation for Relevance
Prediction [18.77909480819682]
We conduct a study into how QGen approaches can be leveraged for nuanced relevance prediction.
We identify new shortcomings of existing QGen approaches -- including their inability to distinguish between different grades of relevance.
We introduce label-grained QGen models which incorporates knowledge about the different relevance.
arXiv Detail & Related papers (2023-05-19T18:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.