A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment
- URL: http://arxiv.org/abs/2503.06241v1
- Date: Sat, 08 Mar 2025 14:53:20 GMT
- Title: A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment
- Authors: Koji Inoue, Yuki Okafuji, Jun Baba, Yoshiki Ohira, Katsuya Hyodo, Tatsuya Kawahara,
- Abstract summary: We propose a noise-robust voice activity projection model to enhance real-time turn-taking in dialogue robots.<n>We conducted a field experiment in a shopping mall, comparing the VAP system with a conventional cloud-based speech recognition system.<n>The results showed that the proposed system significantly reduced response latency, leading to a more natural conversation.
- Score: 18.814181652728486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Turn-taking is a crucial aspect of human-robot interaction, directly influencing conversational fluidity and user engagement. While previous research has explored turn-taking models in controlled environments, their robustness in real-world settings remains underexplored. In this study, we propose a noise-robust voice activity projection (VAP) model, based on a Transformer architecture, to enhance real-time turn-taking in dialogue robots. To evaluate the effectiveness of the proposed system, we conducted a field experiment in a shopping mall, comparing the VAP system with a conventional cloud-based speech recognition system. Our analysis covered both subjective user evaluations and objective behavioral analysis. The results showed that the proposed system significantly reduced response latency, leading to a more natural conversation where both the robot and users responded faster. The subjective evaluations suggested that faster responses contribute to a better interaction experience.
Related papers
- Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models [70.180385882195]
This paper introduces a personality-aware user simulation for Conversational Recommender Systems (CRSs)
The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs.
Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits.
arXiv Detail & Related papers (2025-04-09T13:21:17Z) - Applying General Turn-taking Models to Conversational Human-Robot Interaction [3.8673630752805446]
This paper investigates the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI.<n>We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions.
arXiv Detail & Related papers (2025-01-15T16:49:22Z) - Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis [3.210706100833053]
We propose and implement a fully integrated system that replaces conventional AFE models with Open AI's Whisper.
We show that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions.
arXiv Detail & Related papers (2024-11-20T11:18:05Z) - Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents [23.960719833886984]
M-CoDAL is a multimodal-dialogue system specifically designed for embodied agents to better understand and communicate in safety-critical situations.<n>Our approach is evaluated using a newly created multimodal dataset comprising 1K safety violations extracted from 2K Reddit images.<n>Results with this dataset demonstrate that our approach improves resolution of safety situations, user sentiment, as well as safety of the conversation.
arXiv Detail & Related papers (2024-10-18T03:26:06Z) - Analysis and Detection of Differences in Spoken User Behaviors between Autonomous and Wizard-of-Oz Systems [21.938414385824903]
We analyzed user spoken behaviors in both attentive listening and job interview dialogue scenarios.
Results revealed significant differences in metrics such as speech length, speaking rate, fillers, backchannels, disfluencies, and laughter.
We developed predictive models to distinguish between operator and autonomous system conditions.
arXiv Detail & Related papers (2024-10-04T05:07:55Z) - WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system.
We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration.
We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z) - The Effects of Interactive AI Design on User Behavior: An Eye-tracking
Study of Fact-checking COVID-19 Claims [12.00747200817161]
We conducted a lab-based eye-tracking study to investigate how the interactivity of an AI-powered fact-checking system affects user interactions.
We found that the presence of interactively manipulating the AI system's prediction parameters affected users' dwell times, and eye-fixations on AOIs, but not mental workload.
arXiv Detail & Related papers (2022-02-17T21:08:57Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Nonprehensile Riemannian Motion Predictive Control [57.295751294224765]
We introduce a novel Real-to-Sim reward analysis technique to reliably imagine and predict the outcome of taking possible actions for a real robotic platform.
We produce a closed-loop controller to reactively push objects in a continuous action space.
We observe that RMPC is robust in cluttered as well as occluded environments and outperforms the baselines.
arXiv Detail & Related papers (2021-11-15T18:50:04Z) - Smoothing Dialogue States for Open Conversational Machine Reading [70.83783364292438]
We propose an effective gating strategy by smoothing the two dialogue states in only one decoder and bridge decision making and question generation.
Experiments on the OR-ShARC dataset show the effectiveness of our method, which achieves new state-of-the-art results.
arXiv Detail & Related papers (2021-08-28T08:04:28Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.