Voice2Action: Language Models as Agent for Efficient Real-Time
Interaction in Virtual Reality
- URL: http://arxiv.org/abs/2310.00092v1
- Date: Fri, 29 Sep 2023 19:06:52 GMT
- Title: Voice2Action: Language Models as Agent for Efficient Real-Time
Interaction in Virtual Reality
- Authors: Yang Su
- Abstract summary: Large Language Models (LLMs) are trained to follow natural language instructions with only a handful of examples.
We propose Voice2Action, a framework that hierarchically analyzes customized voice signals and textual commands through action and entity extraction.
Experiment results in an urban engineering VR environment with synthetic instruction data show that Voice2Action can perform more efficiently and accurately than approaches without optimizations.
- Score: 1.160324357508053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are trained and aligned to follow natural
language instructions with only a handful of examples, and they are prompted as
task-driven autonomous agents to adapt to various sources of execution
environments. However, deploying agent LLMs in virtual reality (VR) has been
challenging due to the lack of efficiency in online interactions and the
complex manipulation categories in 3D environments. In this work, we propose
Voice2Action, a framework that hierarchically analyzes customized voice signals
and textual commands through action and entity extraction and divides the
execution tasks into canonical interaction subsets in real-time with error
prevention from environment feedback. Experiment results in an urban
engineering VR environment with synthetic instruction data show that
Voice2Action can perform more efficiently and accurately than approaches
without optimizations.
Related papers
- 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning [2.6670748466660523]
Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks.
VLMs lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations.
We propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs.
arXiv Detail & Related papers (2025-02-13T02:40:19Z) - SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions [48.02083833667388]
We present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions.
We employ low-rank adaptation modules for parameter-efficient training of both the audio encoder and the Large Language Model.
arXiv Detail & Related papers (2025-01-31T18:30:36Z) - VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications [2.5022287664959446]
This study introduces a pioneering approach utilizing Visual Language Models within VR environments to enhance user interaction and task efficiency.
Our system facilitates real-time, intuitive user interactions through natural language processing, without relying on visual text instructions.
arXiv Detail & Related papers (2024-05-19T12:56:00Z) - LEGENT: Open Platform for Embodied Agents [60.71847900126832]
We introduce LEGENT, an open, scalable platform for developing embodied agents using Large Language Models (LLMs) and Large Multimodal Models (LMMs)
LEGENT offers a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface.
In experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks.
arXiv Detail & Related papers (2024-04-28T16:50:12Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation [73.78984332354636]
CorNav is a novel zero-shot framework for vision-and-language navigation.
It incorporates environmental feedback for refining future plans and adjusting its actions.
It consistently outperforms all baselines in a zero-shot multi-task setting.
arXiv Detail & Related papers (2023-06-17T11:44:04Z) - Chat with the Environment: Interactive Multimodal Perception Using Large
Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning.
Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z) - VIRT: Improving Representation-based Models for Text Matching through
Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models.
VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z) - iGibson, a Simulation Environment for Interactive Tasks in Large
Realistic Scenes [54.04456391489063]
iGibson is a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes.
Our environment contains fifteen fully interactive home-sized scenes populated with rigid and articulated objects.
iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of simple human demonstrated behaviors.
arXiv Detail & Related papers (2020-12-05T02:14:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.