InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation
- URL: http://arxiv.org/abs/2512.03197v1
- Date: Tue, 02 Dec 2025 19:51:28 GMT
- Title: InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation
- Authors: Faezeh Faez, Marzieh S. Tahaei, Yaochen Hu, Ali Pourranjbar, Mahdi Biparva, Mark Coates, Yingxue Zhang,
- Abstract summary: InvertiTune is a framework that combines a controlled data generation pipeline with supervised fine-tuning.<n>We show that InvertiTune outperforms larger non-fine-tuned LLMs as well as state-of-the-art Text2KG approaches.
- Score: 22.34499569359914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have revolutionized the ability to understand and generate text, enabling significant progress in automatic knowledge graph construction from text (Text2KG). Many Text2KG methods, however, rely on iterative LLM prompting, making them computationally expensive and prone to overlooking complex relations distributed throughout the text. To address these limitations, we propose InvertiTune, a framework that combines a controlled data generation pipeline with supervised fine-tuning (SFT). Within this framework, the data-generation pipeline systematically extracts subgraphs from large knowledge bases, applies noise filtering, and leverages LLMs to generate corresponding natural text descriptions, a task more aligned with LLM capabilities than direct KG generation from text. This pipeline enables generating datasets composed of longer texts paired with larger KGs that better reflect real-world scenarios compared to existing benchmarks, thus supporting effective SFT of lightweight models for single-shot KG construction. Experimental results on CE12k, a dataset generated using the introduced pipeline, show that InvertiTune outperforms larger non-fine-tuned LLMs as well as state-of-the-art Text2KG approaches, while also demonstrating stronger cross-dataset generalization on CrossEval-1200, a test set created from three established benchmark datasets and CE12k. These findings highlight the importance of realistic, high-quality training data for advancing efficient and high-performing Text2KG systems.
Related papers
- LightKGG: Simple and Efficient Knowledge Graph Generation from Textual Data [0.0]
LightKGG is a novel framework that enables efficient KG extraction from textual data using small-scale language models.<n> Context-integrated Graph extraction integrates contextual information with nodes and edges into a unified graph structure.<n>Topology-enhanced relationship inference leverages the inherent topology of the extracted graph to efficiently infer relationships.
arXiv Detail & Related papers (2025-10-27T13:55:13Z) - SyGra: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data [0.0]
We present a comprehensive synthetic data generation framework for large language models (LLMs)<n>Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention.<n>The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training.
arXiv Detail & Related papers (2025-08-21T10:35:41Z) - ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning [57.767536707234036]
We propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT.<n>Specifically, we first adopt the vision encoder EVA-CLIP to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt.<n>A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously.
arXiv Detail & Related papers (2025-07-02T23:41:31Z) - BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling [51.830134409330704]
Time-series Generation (TSG) is a prominent research area with broad applications in simulations, data augmentation, and counterfactual analysis.<n>We argue that text can provide semantic insights, domain information and instance-specific temporal patterns, to guide and improve TSG.<n>We introduce BRIDGE, a hybrid text-controlled TSG framework that integrates semantic prototypes with text description for supporting domain-level guidance.
arXiv Detail & Related papers (2025-03-04T09:40:00Z) - RAGulator: Lightweight Out-of-Context Detectors for Grounded Text Generation [3.2234520582876467]
We train lightweight models to discriminate LLM-generated text that is semantically out-of-context from retrieved text documents.
We find that DeBERTa is not only the best-performing model under this pipeline, but it is also fast and does not require additional text preprocessing or feature engineering.
arXiv Detail & Related papers (2024-11-06T13:51:42Z) - Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation [51.20656279478878]
MATRIX is a multi-agent simulator that automatically generates diverse text-based scenarios.<n>We introduce MATRIX-Gen for controllable and highly realistic data synthesis.<n>On AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta's Llama-3-8B-Instruct model.
arXiv Detail & Related papers (2024-10-18T08:01:39Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction [36.915250638481986]
We introduce LiveSum, a new benchmark dataset for generating summary tables of competitions based on real-time commentary texts.<n>We evaluate the performances of state-of-the-art Large Language Models on this task in both fine-tuning and zero-shot settings.<n>We additionally propose a novel pipeline called $T3$(Text-Tuple-Table) to improve their performances.
arXiv Detail & Related papers (2024-04-22T14:31:28Z) - Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.