MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs
- URL: http://arxiv.org/abs/2602.23632v1
- Date: Fri, 27 Feb 2026 03:10:51 GMT
- Title: MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs
- Authors: Lun Zhan, Feng Xiong, Huanyong Liu, Feng Zhang, Yuhui Yin,
- Abstract summary: We propose MMKG-RDS, a flexible framework for reasoning data synthesis.<n>It supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring.<n>We validate MMKG-RDS with the MMKG-RDS-Bench dataset, covering five domains, 17 task types, and 14,950 samples.
- Score: 9.723204200825583
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Synthesizing high-quality training data is crucial for enhancing domain models' reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these issues, we propose MMKG-RDS, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs. It supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. We validate MMKG-RDS with the MMKG-RDS-Bench dataset, covering five domains, 17 task types, and 14,950 samples. Experimental results show fine-tuning Qwen3 models (0.6B/8B/32B) on a small number of synthesized samples improves reasoning accuracy by 9.2%. The framework also generates distinct data, challenging existing models on tasks involving tables and formulas, useful for complex benchmark construction. The dataset and code are available at https://github.com/360AILAB-NLP/MMKG-RDS
Related papers
- Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs [3.222543736797976]
SynthKGQA is a framework for generating high-quality synthetic Knowledge Graph Question Answering datasets from any Knowledge Graph.<n>We show how, in addition to enabling more informative benchmarking of KG retrievers, the data produced with SynthKGQA also allows us to train better models.
arXiv Detail & Related papers (2025-11-06T15:45:18Z) - Scaling Generalist Data-Analytic Agents [95.05161133349242]
DataMind is a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents.<n>DataMind tackles three key challenges in building open-source data-analytic agents.
arXiv Detail & Related papers (2025-09-29T17:23:08Z) - FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering [15.594694018908188]
Multimodal multihop question answering (MMQA) requires reasoning over images and text from multiple sources.<n>Existing methods focus on single-hop, single-modality, or short texts.<n>We introduce FM2DS, the first framework for creating a high-quality dataset for MMQA.
arXiv Detail & Related papers (2024-12-09T22:35:44Z) - AIDE: Attribute-Guided MultI-Hop Data Expansion for Data Scarcity in Task-Specific Fine-tuning [15.916595953695603]
Fine-tuning large language models (LLMs) for specific tasks requires diverse, high-quality training data.<n>Existing data synthesis methods either depend on extensive seed datasets or struggle to balance task relevance and data diversity.<n>We propose Attribute-guided multI-hop Data Expansion (AIDE), a novel data synthesis framework that uses a multi-hop process to expand very few seed data points.
arXiv Detail & Related papers (2024-12-09T01:39:16Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency [59.6772484292295]
Knowledge graphs (KGs) generated by large language models (LLMs) are increasingly valuable for Retrieval-Augmented Generation (RAG) applications.
Existing KG extraction methods rely on prompt-based approaches, which are inefficient for processing large-scale corpora.
We propose SynthKG, a multi-step, document-level synthesis KG workflow based on LLMs.
We also design a novel graph-based retrieval framework for RAG.
arXiv Detail & Related papers (2024-10-22T00:47:54Z) - LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models [55.903148392998965]
We introduce LOKI, a novel benchmark designed to evaluate the ability of LMMs to detect synthetic data across multiple modalities.<n>The benchmark includes coarse-grained judgment and multiple-choice questions, as well as fine-grained anomaly selection and explanation tasks.<n>We evaluate 22 open-source LMMs and 6 closed-source models on LOKI, highlighting their potential as synthetic data detectors and also revealing some limitations in the development of LMM capabilities.
arXiv Detail & Related papers (2024-10-13T05:26:36Z) - What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.<n>Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.<n>We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.<n>Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z) - Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning [110.80663974060624]
Key-Point-Driven Data Synthesis (KPDDS) is a novel data synthesis framework that synthesizes question-answer pairs.
KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability.
We present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs.
arXiv Detail & Related papers (2024-03-04T18:58:30Z) - Integrating Transformer and Autoencoder Techniques with Spectral Graph
Algorithms for the Prediction of Scarcely Labeled Molecular Data [2.8360662552057323]
This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge.
Specifically, graph-based modifications of the MBO scheme is integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder.
The proposed models are validated using five benchmark data sets.
arXiv Detail & Related papers (2022-11-12T22:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.