Related papers: Spatial Knowledge Graph-Guided Multimodal Synthesis

Spatial Knowledge Graph-Guided Multimodal Synthesis

URL: http://arxiv.org/abs/2505.22633v2
Date: Sun, 02 Nov 2025 21:42:54 GMT
Title: Spatial Knowledge Graph-Guided Multimodal Synthesis
Authors: Yida Xue, Zhen Bi, Jinnan Yang, Jungang Lou, Kehai Chen, Min Zhang, Huajun Chen, Ningyu Zhang,
Abstract summary: We introduce a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation.<n>In experiments, data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly.<n>We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.
Score: 78.11669780958657
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.

Related papers

Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis [8.60591720958037]
Vision-Language Models (VLMs) are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable.<n>We introduce SP-RITE, a novel framework that overcomes this dilemma leveraging simulators and large models.<n>We have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs.<n>We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks.
arXiv Detail & Related papers (2025-12-18T06:30:08Z)
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding [64.86119288520419]
multimodal language models struggle with spatial reasoning across time and space.<n>We present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators.<n>Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.
arXiv Detail & Related papers (2025-11-06T18:53:31Z)
Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z)
UrbanInsight: A Distributed Edge Computing Framework with LLM-Powered Data Filtering for Smart City Digital Twins [5.477477311297089]
Cities generate enormous streams of data from sensors, cameras, and connected infrastructure.<n>Most existing systems struggle with scale, latency, and fragmented insights.<n>This work introduces a framework that blends physics-informed machine learning, multimodal data fusion, and knowledge graph representation.
arXiv Detail & Related papers (2025-08-31T17:10:31Z)
Can LLMs Learn to Map the World from Local Descriptions? [50.490593949836146]
This study investigates whether Large Language Models (LLMs) can construct coherent global spatial cognition.<n> Experiments conducted in a simulated urban environment demonstrate that LLMs exhibit latent representations aligned with real-world spatial distributions.
arXiv Detail & Related papers (2025-05-27T08:22:58Z)
OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence [51.0456395687016]
multimodal large language models (LLMs) have opened new frontiers in artificial intelligence.<n>We propose a MLLM (OmniGeo) tailored to geospatial applications.<n>By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems.
arXiv Detail & Related papers (2025-03-20T16:45:48Z)
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data [14.104497777255137]
We introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations.<n>We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies.<n> Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models.
arXiv Detail & Related papers (2025-03-17T05:42:19Z)
EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [24.41705039390567]
EmbodiedVSR (Embodied Visual Spatial Reasoning) is a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning.<n>Our method enables zero-shot spatial reasoning without task-specific fine-tuning.<n>Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence.
arXiv Detail & Related papers (2025-03-14T05:06:07Z)
Conservation-informed Graph Learning for Spatiotemporal Dynamics Prediction [84.26340606752763]
In this paper, we introduce the conservation-informed GNN (CiGNN), an end-to-end explainable learning framework.<n>The network is designed to conform to the general symmetry conservation law via symmetry where conservative and non-conservative information passes over a multiscale space by a latent temporal marching strategy.<n>Results demonstrate that CiGNN exhibits remarkable baseline accuracy and generalizability, and is readily applicable to learning for prediction of varioustemporal dynamics.
arXiv Detail & Related papers (2024-12-30T13:55:59Z)
Spherinator and HiPSter: Representation Learning for Unbiased Knowledge Discovery from Simulations [0.0]
We describe a new, unbiased, and machine learning based approach to obtain useful scientific insights from a broad range of simulations. Our concept is based on applying nonlinear dimensionality reduction to learn compact representations of the data in a low-dimensional space. We present a prototype using a rotational invariant hyperspherical variational convolutional autoencoder, utilizing a power distribution in the latent space, and trained on galaxies from IllustrisTNG simulation.
arXiv Detail & Related papers (2024-06-06T07:34:58Z)
Deep Learning for Spatiotemporal Big Data: A Vision on Opportunities and Challenges [4.497634148674422]
Intemporal big data can foster new opportunities to solve problems that have not been possible before. The distinctive characteristics of big data pose new challenges for deep learning technologies.
arXiv Detail & Related papers (2023-10-30T19:12:51Z)
Approach to Data Science with Multiscale Information Theory [0.0]
Data Science is a multidisciplinary field that plays a crucial role in extracting valuable insights from large and intricate datasets. Within the realm of Data Science, two fundamental components are Information Theory (IT) and Statistical Mechanics (SM) In this paper, we apply this data science framework to a large and intricate mechanical system composed of particles.
arXiv Detail & Related papers (2023-05-23T01:08:50Z)
Semantic Segmentation of Vegetation in Remote Sensing Imagery Using Deep Learning [77.34726150561087]
We propose an approach for creating a multi-modal and large-temporal dataset comprised of publicly available Remote Sensing data. We use Convolutional Neural Networks (CNN) models that are capable of separating different classes of vegetation.
arXiv Detail & Related papers (2022-09-28T18:51:59Z)
Dominant motion identification of multi-particle system using deep learning from video [0.0]
In this work, we provide a deep-learning framework that extracts relevant information from real-world videos of highly systems. We demonstrate this approach on videos of confined multi-agent/particle systems of ants, termites, fishes. Furthermore, we explore how these seemingly diverse systems have predictable underlying behavior.
arXiv Detail & Related papers (2021-04-26T17:10:56Z)
Synthetic Data: Opening the data floodgates to enable faster, more directed development of machine learning methods [96.92041573661407]
Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data. Many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available to the machine learning community. Generating synthetic data with privacy guarantees provides one such solution.
arXiv Detail & Related papers (2020-12-08T17:26:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.