Related papers: CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR

CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR

URL: http://arxiv.org/abs/2411.04671v1
Date: Thu, 07 Nov 2024 12:55:17 GMT
Title: CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR
Authors: Kadir Burak Buldu, Süleyman Özdel, Ka Hei Carrie Lau, Mengdi Wang, Daniel Saad, Sofie Schönborn, Auxane Boch, Enkelejda Kasneci, Efe Bozkir,
Abstract summary: Large language model (LLM)powered non-player characters (NPCs) with speech-to-text (STT) and text-to-speech (TTS) models bring significant advantages over conventional or pre-scripted NPCs for facilitating more natural conversational user interfaces (CUIs) in XR. We provide the community with an open-source, customizable, and privacy-aware Unity package, CUIfy, that facilitates speech-based NPC-user interaction with various LLMs, STT, and TTS models.
Score: 31.49021749468963
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent developments in computer graphics, machine learning, and sensor technologies enable numerous opportunities for extended reality (XR) setups for everyday life, from skills training to entertainment. With large corporations offering consumer-grade head-mounted displays (HMDs) in an affordable way, it is likely that XR will become pervasive, and HMDs will develop as personal devices like smartphones and tablets. However, having intelligent spaces and naturalistic interactions in XR is as important as technological advances so that users grow their engagement in virtual and augmented spaces. To this end, large language model (LLM)--powered non-player characters (NPCs) with speech-to-text (STT) and text-to-speech (TTS) models bring significant advantages over conventional or pre-scripted NPCs for facilitating more natural conversational user interfaces (CUIs) in XR. In this paper, we provide the community with an open-source, customizable, extensible, and privacy-aware Unity package, CUIfy, that facilitates speech-based NPC-user interaction with various LLMs, STT, and TTS models. Our package also supports multiple LLM-powered NPCs per environment and minimizes the latency between different computational models through streaming to achieve usable interactions between users and NPCs. We publish our source code in the following repository: https://gitlab.lrz.de/hctl/cuify

Related papers

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model [47.84522683404745]
We present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions.<n>Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S employs a streaming interleaved decoding architecture to achieve low-latency speech generation.<n>By leveraging large language models to generate empathetic content and controllable text-to-speech systems, we construct a scalable training corpus with rich paralinguistic diversity.
arXiv Detail & Related papers (2025-07-07T16:31:37Z)
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning [53.16179295245888]
We introduce SIV-Bench, a novel video benchmark for evaluating the capabilities of Multimodal Large Language Models (MLLMs) across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP)<n>SIV-Bench features 2,792 video clips and 8,792 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline.<n>It also includes a dedicated setup for analyzing the impact of different textual cues-original on-screen text, added dialogue, or no text.
arXiv Detail & Related papers (2025-06-05T05:51:35Z)
Recent Advances and Future Directions in Extended Reality (XR): Exploring AI-Powered Spatial Intelligence [0.0]
Extended Reality (XR), encompassing Augmented Reality (AR), Virtual Reality (VR) and Mixed Reality (MR), is a transformative technology bridging the physical and virtual world. This review examines XR's evolution through foundational framework - hardware ranging from monitors to sensors and software ranging from visual tasks to user interface. For future directions, attention should be given to the integration of multi-modal AI and IoT-driven digital twins to enable adaptive XR systems.
arXiv Detail & Related papers (2025-04-22T15:11:55Z)
Towards Anthropomorphic Conversational AI Part I: A Practical Framework [49.62013440962072]
We introduce a multi- module framework designed to replicate the key aspects of human intelligence involved in conversations. In the second stage of our approach, these conversational data, after filtering and labeling, can serve as training and testing data for reinforcement learning.
arXiv Detail & Related papers (2025-02-28T03:18:39Z)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments. We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents [11.928422245125985]
Open Omni is an open-source, end-to-end pipeline benchmarking tool. It integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Augmented Generation, Large Language Models. It supports local and cloud deployment, ensuring data privacy and supporting latency and accuracy benchmarking.
arXiv Detail & Related papers (2024-08-06T09:02:53Z)
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning [74.58666091522198]
We present a framework for intuitive robot programming by non-experts. We leverage natural language prompts and contextual information from the Robot Operating System (ROS) Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface.
arXiv Detail & Related papers (2024-06-28T08:28:38Z)
Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality [28.27036270001756]
This work designs an autonomous workflow tailored for integrating AI agents seamlessly into extended reality (XR) applications for fine-grained training. We present a demonstration of a multimodal fine-grained training assistant for LEGO brick assembly in a pilot XR environment.
arXiv Detail & Related papers (2024-05-16T14:20:30Z)
LEGENT: Open Platform for Embodied Agents [60.71847900126832]
We introduce LEGENT, an open, scalable platform for developing embodied agents using Large Language Models (LLMs) and Large Multimodal Models (LMMs) LEGENT offers a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface. In experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks.
arXiv Detail & Related papers (2024-04-28T16:50:12Z)
ChatTracer: Large Language Model Powered Real-time Bluetooth Device Tracking System [7.21848268647674]
We present ChatTracer, an LLM-powered real-time Bluetooth device tracking system. ChatTracer comprises an array of Bluetooth sniffing nodes, a database, and a fine-tuned LLM. We have built a prototype of ChatTracer with four sniffing nodes.
arXiv Detail & Related papers (2024-03-28T21:04:11Z)
Embedding Large Language Models into Extended Reality: Opportunities and Challenges for Inclusion, Engagement, and Privacy [37.061999275101904]
We argue for using large language models in XR by embedding them in avatars or as narratives to facilitate inclusion. We speculate that combining the information provided to LLM-powered spaces by users and the biometric data obtained might lead to novel privacy invasions.
arXiv Detail & Related papers (2024-02-06T11:19:40Z)
SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems [53.94772445896213]
Large Language Model (LLM)-based multi-agent systems have demonstrated promising performance in simulating human society. We propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication.
arXiv Detail & Related papers (2024-01-08T15:01:08Z)
Agents: An Open-source Framework for Autonomous Language Agents [98.91085725608917]
We consider language agents as a promising direction towards artificial general intelligence. We release Agents, an open-source library with the goal of opening up these advances to a wider non-specialist audience.
arXiv Detail & Related papers (2023-09-14T17:18:25Z)
GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System [8.660929270060146]
This technical paper introduces a chatting robot system that utilizes recent advancements in large-scale language models (LLMs) The system is integrated with a co-speech gesture generation system, which selects appropriate gestures based on the conceptual meaning of speech.
arXiv Detail & Related papers (2023-05-10T10:14:16Z)
Unmasking Communication Partners: A Low-Cost AI Solution for Digitally Removing Head-Mounted Displays in VR-Based Telepresence [62.997667081978825]
Face-to-face conversation in Virtual Reality (VR) is a challenge when participants wear head-mounted displays (HMD) Past research has shown that high-fidelity face reconstruction with personal avatars in VR is possible under laboratory conditions with high-cost hardware. We propose one of the first low-cost systems for this task which uses only open source, free software and affordable hardware.
arXiv Detail & Related papers (2020-11-06T23:17:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.