Related papers: CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR

CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR

URL: http://arxiv.org/abs/2411.04671v3
Date: Mon, 03 Mar 2025 13:41:33 GMT
Title: CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR
Authors: Kadir Burak Buldu, Süleyman Özdel, Ka Hei Carrie Lau, Mengdi Wang, Daniel Saad, Sofie Schönborn, Auxane Boch, Enkelejda Kasneci, Efe Bozkir,
Abstract summary: Large language model (LLM)powered non-player characters (NPCs) with speech-to-text (STT) and text-to-speech (TTS) models bring significant advantages over conventional or pre-scripted NPCs for facilitating more natural conversational user interfaces (CUIs) in XR.<n>This paper provides the community with an open-source, customizable, extendable, and privacy-aware Unity package, CUIfy, that facilitates speech-based NPC-user interaction with widely used LLMs, STT, and TTS models.
Score: 31.49021749468963
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent developments in computer graphics, machine learning, and sensor technologies enable numerous opportunities for extended reality (XR) setups for everyday life, from skills training to entertainment. With large corporations offering affordable consumer-grade head-mounted displays (HMDs), XR will likely become pervasive, and HMDs will develop as personal devices like smartphones and tablets. However, having intelligent spaces and naturalistic interactions in XR is as important as technological advances so that users grow their engagement in virtual and augmented spaces. To this end, large language model (LLM)--powered non-player characters (NPCs) with speech-to-text (STT) and text-to-speech (TTS) models bring significant advantages over conventional or pre-scripted NPCs for facilitating more natural conversational user interfaces (CUIs) in XR. This paper provides the community with an open-source, customizable, extendable, and privacy-aware Unity package, CUIfy, that facilitates speech-based NPC-user interaction with widely used LLMs, STT, and TTS models. Our package also supports multiple LLM-powered NPCs per environment and minimizes latency between different computational models through streaming to achieve usable interactions between users and NPCs. We publish our source code in the following repository: https://gitlab.lrz.de/hctl/cuify

Related papers

SARAH: Spatially Aware Real-time Agentic Humans [58.32612596034656]
We present the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset.<n>Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user.<n>We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment.
arXiv Detail & Related papers (2026-02-20T18:59:35Z)
Fixed-Persona SLMs with Modular Memory: Scalable NPC Dialogue on Consumer Hardware [0.0]
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text, yet their applicability to dialogue systems in computer games remains limited.<n>In this paper, we propose a modular NPC dialogue system that leverages Small Language Models (SLMs), fine-tuned to encode specific NPC personas and integrated with runtime-swappable memory modules.<n>While our approach is motivated by applications in gaming, its modular design and persona-driven memory architecture hold significant potential for broader adoption in domains requiring expressive, scalable, and memory-rich conversational agents, such as virtual assistants, customer support bots, or interactive educational systems.
arXiv Detail & Related papers (2025-11-13T13:03:37Z)
XR Blocks: Accelerating Human-centered AI + XR Innovation [15.103185935604323]
XR Blocks is a cross-platform framework designed to accelerate human-centered AI + XR innovation.<n>It provides a modular architecture with plug-and-play components for core abstraction in AI + XR: user, world, peers; interface, context, and agents.
arXiv Detail & Related papers (2025-09-29T21:00:53Z)
OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model [47.84522683404745]
We present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions.<n>Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S employs a streaming interleaved decoding architecture to achieve low-latency speech generation.<n>By leveraging large language models to generate empathetic content and controllable text-to-speech systems, we construct a scalable training corpus with rich paralinguistic diversity.
arXiv Detail & Related papers (2025-07-07T16:31:37Z)
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning [53.16179295245888]
We introduce SIV-Bench, a novel video benchmark for evaluating the capabilities of Multimodal Large Language Models (MLLMs) across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP)<n>SIV-Bench features 2,792 video clips and 8,792 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline.<n>It also includes a dedicated setup for analyzing the impact of different textual cues-original on-screen text, added dialogue, or no text.
arXiv Detail & Related papers (2025-06-05T05:51:35Z)
Recent Advances and Future Directions in Extended Reality (XR): Exploring AI-Powered Spatial Intelligence [0.0]
Extended Reality (XR), encompassing Augmented Reality (AR), Virtual Reality (VR) and Mixed Reality (MR), is a transformative technology bridging the physical and virtual world. This review examines XR's evolution through foundational framework - hardware ranging from monitors to sensors and software ranging from visual tasks to user interface. For future directions, attention should be given to the integration of multi-modal AI and IoT-driven digital twins to enable adaptive XR systems.
arXiv Detail & Related papers (2025-04-22T15:11:55Z)
Towards Anthropomorphic Conversational AI Part I: A Practical Framework [49.62013440962072]
We introduce a multi- module framework designed to replicate the key aspects of human intelligence involved in conversations. In the second stage of our approach, these conversational data, after filtering and labeling, can serve as training and testing data for reinforcement learning.
arXiv Detail & Related papers (2025-02-28T03:18:39Z)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments. We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents [11.928422245125985]
Open Omni is an open-source, end-to-end pipeline benchmarking tool. It integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Augmented Generation, Large Language Models. It supports local and cloud deployment, ensuring data privacy and supporting latency and accuracy benchmarking.
arXiv Detail & Related papers (2024-08-06T09:02:53Z)
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning [74.58666091522198]
We present a framework for intuitive robot programming by non-experts. We leverage natural language prompts and contextual information from the Robot Operating System (ROS) Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface.
arXiv Detail & Related papers (2024-06-28T08:28:38Z)
Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality [28.27036270001756]
This work designs an autonomous workflow tailored for integrating AI agents seamlessly into extended reality (XR) applications for fine-grained training. We present a demonstration of a multimodal fine-grained training assistant for LEGO brick assembly in a pilot XR environment.
arXiv Detail & Related papers (2024-05-16T14:20:30Z)
LEGENT: Open Platform for Embodied Agents [60.71847900126832]
We introduce LEGENT, an open, scalable platform for developing embodied agents using Large Language Models (LLMs) and Large Multimodal Models (LMMs) LEGENT offers a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface. In experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks.
arXiv Detail & Related papers (2024-04-28T16:50:12Z)
ChatTracer: Large Language Model Powered Real-time Bluetooth Device Tracking System [7.21848268647674]
We present ChatTracer, an LLM-powered real-time Bluetooth device tracking system. ChatTracer comprises an array of Bluetooth sniffing nodes, a database, and a fine-tuned LLM. We have built a prototype of ChatTracer with four sniffing nodes.
arXiv Detail & Related papers (2024-03-28T21:04:11Z)
Embedding Large Language Models into Extended Reality: Opportunities and Challenges for Inclusion, Engagement, and Privacy [37.061999275101904]
We argue for using large language models in XR by embedding them in avatars or as narratives to facilitate inclusion. We speculate that combining the information provided to LLM-powered spaces by users and the biometric data obtained might lead to novel privacy invasions.
arXiv Detail & Related papers (2024-02-06T11:19:40Z)
SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems [53.94772445896213]
Large Language Model (LLM)-based multi-agent systems have demonstrated promising performance in simulating human society. We propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication.
arXiv Detail & Related papers (2024-01-08T15:01:08Z)
Agents: An Open-source Framework for Autonomous Language Agents [98.91085725608917]
We consider language agents as a promising direction towards artificial general intelligence. We release Agents, an open-source library with the goal of opening up these advances to a wider non-specialist audience.
arXiv Detail & Related papers (2023-09-14T17:18:25Z)
GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System [8.660929270060146]
This technical paper introduces a chatting robot system that utilizes recent advancements in large-scale language models (LLMs) The system is integrated with a co-speech gesture generation system, which selects appropriate gestures based on the conceptual meaning of speech.
arXiv Detail & Related papers (2023-05-10T10:14:16Z)
Unmasking Communication Partners: A Low-Cost AI Solution for Digitally Removing Head-Mounted Displays in VR-Based Telepresence [62.997667081978825]
Face-to-face conversation in Virtual Reality (VR) is a challenge when participants wear head-mounted displays (HMD) Past research has shown that high-fidelity face reconstruction with personal avatars in VR is possible under laboratory conditions with high-cost hardware. We propose one of the first low-cost systems for this task which uses only open source, free software and affordable hardware.
arXiv Detail & Related papers (2020-11-06T23:17:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.