Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems
- URL: http://arxiv.org/abs/2512.20387v2
- Date: Fri, 26 Dec 2025 07:42:26 GMT
- Title: Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems
- Authors: YuChe Hsu, AnJui Wang, TsaiChing Ni, YuanFu Yang,
- Abstract summary: We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript.<n>To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins.
- Score: 1.0499611180329804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.
Related papers
- Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z) - URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model [76.08429266631823]
We propose an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM)<n>URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction.<n> Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-11-02T13:45:51Z) - Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents [99.62178668680578]
We propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer.<n> VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images.<n>To capture complex cross-modal relationships in web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments.
arXiv Detail & Related papers (2025-10-21T14:59:29Z) - Vision Language Action Models in Robotic Manipulation: A Systematic Review [1.1767330101986737]
Vision Language Action (VLA) models represent a transformative shift in robotics.<n>This review presents a comprehensive and forward-looking synthesis of the VLA paradigm.<n>We analyze 102 VLA models, 26 foundational datasets, and 12 simulation platforms.
arXiv Detail & Related papers (2025-07-14T18:00:34Z) - A Neurosymbolic Agent System for Compositional Visual Reasoning [31.649454833851863]
Existing vision-language models (VLMs) remain challenged by compositional visual reasoning.<n>This paper presents a neuro-symbolic approach to developing a Vision-Language Agent system for efficient compositional visual reasoning.
arXiv Detail & Related papers (2025-06-09T13:55:55Z) - Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [89.77871049500546]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding.<n>Existing solutions often rely on task-specific architectures and objectives for individual tasks.<n>In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z) - LLM experiments with simulation: Large Language Model Multi-Agent System for Simulation Model Parametrization in Digital Twins [4.773175285216063]
This paper presents a novel framework that applies large language models (LLMs) to automate the parametrization of simulation models in digital twins.
The proposed approach enhances the usability of simulation model by infusing it with knowledges from LLM.
The system has the potential to increase user-friendliness and reduce the cognitive load on human users.
arXiv Detail & Related papers (2024-05-28T11:59:40Z) - Generation of Asset Administration Shell with Large Language Model Agents: Toward Semantic Interoperability in Digital Twins in the Context of Industry 4.0 [0.6749750044497732]
This research introduces a novel approach for achieving semantic interoperability in digital twins.
It assists the creation of Asset Administration Shell (AAS) as digital twin model within the context of Industry 4.0.
arXiv Detail & Related papers (2024-03-25T21:37:30Z) - i-Code V2: An Autoregressive Generation Framework over Vision, Language,
and Speech Data [101.52821120195975]
i-Code V2 is first model capable of generating natural language from any combination of Vision, Language, and Speech data.
System is pretrained end-to-end on a large collection of dual- and single-modality datasets.
arXiv Detail & Related papers (2023-05-21T01:25:44Z) - From Natural Language to Simulations: Applying GPT-3 Codex to Automate
Simulation Modeling of Logistics Systems [0.0]
This work is the first attempt to apply Natural Language Processing to automate the development of simulation models of systems vitally important for logistics.
We demonstrated that the framework built on top of the fine-tuned GPT-3 Codex, a Transformer-based language model, could produce functionally valid simulations of queuing and inventory control systems given the verbal description.
arXiv Detail & Related papers (2022-02-24T14:01:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.