Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task
- URL: http://arxiv.org/abs/2508.03699v1
- Date: Sat, 19 Jul 2025 07:37:48 GMT
- Title: Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task
- Authors: Subin Raj Peter,
- Abstract summary: This paper proposes a novel approach that leverages Large Language Models (LLMs) to automate the generation of virtual instructions from textual input.<n>The system comprises two core components: an LLM module that extracts task-relevant information from the text, and an intelligent module that transforms this information into animated demonstrations and visual cues within a VR environment.<n>This approach enhances training effectiveness and reduces development overhead, making VR-based training more scalable and adaptable to evolving industrial needs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Virtual Reality (VR) has emerged as a powerful tool for workforce training, offering immersive, interactive, and risk-free environments that enhance skill acquisition, decision-making, and confidence. Despite its advantages, developing VR applications for training remains a significant challenge due to the time, expertise, and resources required to create accurate and engaging instructional content. To address these limitations, this paper proposes a novel approach that leverages Large Language Models (LLMs) to automate the generation of virtual instructions from textual input. The system comprises two core components: an LLM module that extracts task-relevant information from the text, and an intelligent module that transforms this information into animated demonstrations and visual cues within a VR environment. The intelligent module receives input from the LLM module and interprets the extracted information. Based on this, an instruction generator creates training content using relevant data from a database. The instruction generator generates the instruction by changing the color of virtual objects and creating animations to illustrate tasks. This approach enhances training effectiveness and reduces development overhead, making VR-based training more scalable and adaptable to evolving industrial needs.
Related papers
- VirtualEnv: A Platform for Embodied AI Research [26.527818430035534]
We present VirtualEnv, a next-generation simulation platform built on Unreal Engine 5.<n>It enables fine-grained benchmarking of large language models (LLMs) in embodied and interactive scenarios.<n>We provide a user-friendly API built on top of Unreal Engine, allowing researchers to deploy and control LLM-driven agents.
arXiv Detail & Related papers (2026-01-12T14:04:38Z) - Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval [53.54695034420311]
In practice, videos are typically untrimmed in long durations with much more complicated background content.<n>We propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model.<n>Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets.
arXiv Detail & Related papers (2025-10-14T08:38:20Z) - BLAZER: Bootstrapping LLM-based Manipulation Agents with Zero-Shot Data Generation [59.70634559248202]
BLAZER is a framework that learns manipulation policies from automatically generated training data.<n>We show BLAZER to significantly improve zero-shot manipulation in both simulated and real environments.<n>Our code and data will be made publicly available on the project page.
arXiv Detail & Related papers (2025-10-09T17:59:58Z) - Mano Technical Report [29.551514304095296]
Mano is a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data.<n>Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld.
arXiv Detail & Related papers (2025-09-22T03:13:58Z) - VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction [1.8880253210887832]
VisuCraft is a novel framework designed to enhance the capabilities of Large Vision-Language Models (LVLMs) in complex visual-guided creative content generation.<n>Our results demonstrate remarkable improvements, particularly in creativity and instruction adherence, validating VisuCraft's effectiveness in producing imaginative, visually grounded, and user-aligned long-form creative text.
arXiv Detail & Related papers (2025-08-04T20:36:55Z) - LuciBot: Automated Robot Policy Learning from Generated Videos [45.04449337744593]
Large language models (LLMs) or vision-language models (VLMs) are largely limited to simple tasks with well-defined rewards, such as pick-and-place.<n>We leverage the imagination capability of general-purpose video generation models to generate training supervision for embodied tasks.<n>Our approach significantly improves supervision quality for complex embodied tasks, enabling large-scale training in simulators.
arXiv Detail & Related papers (2025-03-12T22:07:36Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications [2.5022287664959446]
This study introduces a pioneering approach utilizing Visual Language Models within VR environments to enhance user interaction and task efficiency.
Our system facilitates real-time, intuitive user interactions through natural language processing, without relying on visual text instructions.
arXiv Detail & Related papers (2024-05-19T12:56:00Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs)<n>Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension.<n>In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.