Related papers: Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task

Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task

URL: http://arxiv.org/abs/2508.03699v1
Date: Sat, 19 Jul 2025 07:37:48 GMT
Title: Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task
Authors: Subin Raj Peter,
Abstract summary: This paper proposes a novel approach that leverages Large Language Models (LLMs) to automate the generation of virtual instructions from textual input.<n>The system comprises two core components: an LLM module that extracts task-relevant information from the text, and an intelligent module that transforms this information into animated demonstrations and visual cues within a VR environment.<n>This approach enhances training effectiveness and reduces development overhead, making VR-based training more scalable and adaptable to evolving industrial needs.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Virtual Reality (VR) has emerged as a powerful tool for workforce training, offering immersive, interactive, and risk-free environments that enhance skill acquisition, decision-making, and confidence. Despite its advantages, developing VR applications for training remains a significant challenge due to the time, expertise, and resources required to create accurate and engaging instructional content. To address these limitations, this paper proposes a novel approach that leverages Large Language Models (LLMs) to automate the generation of virtual instructions from textual input. The system comprises two core components: an LLM module that extracts task-relevant information from the text, and an intelligent module that transforms this information into animated demonstrations and visual cues within a VR environment. The intelligent module receives input from the LLM module and interprets the extracted information. Based on this, an instruction generator creates training content using relevant data from a database. The instruction generator generates the instruction by changing the color of virtual objects and creating animations to illustrate tasks. This approach enhances training effectiveness and reduces development overhead, making VR-based training more scalable and adaptable to evolving industrial needs.

Related papers

VirtualEnv: A Platform for Embodied AI Research [26.527818430035534]
We present VirtualEnv, a next-generation simulation platform built on Unreal Engine 5.<n>It enables fine-grained benchmarking of large language models (LLMs) in embodied and interactive scenarios.<n>We provide a user-friendly API built on top of Unreal Engine, allowing researchers to deploy and control LLM-driven agents.
arXiv Detail & Related papers (2026-01-12T14:04:38Z)
Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval [53.54695034420311]
In practice, videos are typically untrimmed in long durations with much more complicated background content.<n>We propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model.<n>Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets.
arXiv Detail & Related papers (2025-10-14T08:38:20Z)
BLAZER: Bootstrapping LLM-based Manipulation Agents with Zero-Shot Data Generation [59.70634559248202]
BLAZER is a framework that learns manipulation policies from automatically generated training data.<n>We show BLAZER to significantly improve zero-shot manipulation in both simulated and real environments.<n>Our code and data will be made publicly available on the project page.
arXiv Detail & Related papers (2025-10-09T17:59:58Z)
Mano Technical Report [29.551514304095296]
Mano is a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data.<n>Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld.
arXiv Detail & Related papers (2025-09-22T03:13:58Z)
VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction [1.8880253210887832]
VisuCraft is a novel framework designed to enhance the capabilities of Large Vision-Language Models (LVLMs) in complex visual-guided creative content generation.<n>Our results demonstrate remarkable improvements, particularly in creativity and instruction adherence, validating VisuCraft's effectiveness in producing imaginative, visually grounded, and user-aligned long-form creative text.
arXiv Detail & Related papers (2025-08-04T20:36:55Z)
LuciBot: Automated Robot Policy Learning from Generated Videos [45.04449337744593]
Large language models (LLMs) or vision-language models (VLMs) are largely limited to simple tasks with well-defined rewards, such as pick-and-place.<n>We leverage the imagination capability of general-purpose video generation models to generate training supervision for embodied tasks.<n>Our approach significantly improves supervision quality for complex embodied tasks, enabling large-scale training in simulators.
arXiv Detail & Related papers (2025-03-12T22:07:36Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications [2.5022287664959446]
This study introduces a pioneering approach utilizing Visual Language Models within VR environments to enhance user interaction and task efficiency. Our system facilitates real-time, intuitive user interactions through natural language processing, without relying on visual text instructions.
arXiv Detail & Related papers (2024-05-19T12:56:00Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs)<n>Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension.<n>In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame. ATM outperforms strong video pre-training baselines by 80% on average. We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.