Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction
- URL: http://arxiv.org/abs/2505.21043v1
- Date: Tue, 27 May 2025 11:24:38 GMT
- Title: Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction
- Authors: Sam O'Connor Russell, Naomi Harte,
- Abstract summary: Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech.<n>We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze.<n>We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy).
- Score: 7.412918099791407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.
Related papers
- MultiVox: Benchmarking Voice Assistants for Multimodal Interactions [43.55740197419447]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Our evaluation on 9 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z) - PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling [78.61911985138795]
We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams.<n>We propose the Predictive Future Modeling framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues.<n>Experiments show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters.
arXiv Detail & Related papers (2025-05-29T06:46:19Z) - SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering [0.0]
We introduce SViQA, a unified speech-vision model that processes spoken questions without text transcription.<n>Building upon the LLaVA architecture, our framework bridges auditory and visual modalities through two key innovations.<n>Extensive experimental results on the SBVQA benchmark demonstrate the proposed SViQA's state-of-the-art performance.
arXiv Detail & Related papers (2025-04-01T07:15:32Z) - Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules.<n>An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics.<n>We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
arXiv Detail & Related papers (2025-03-19T18:40:45Z) - Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection [24.71649541757314]
Short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue.<n>This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection model.
arXiv Detail & Related papers (2024-10-21T11:57:56Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - MAAS: Multi-modal Assignation for Active Speaker Detection [59.08836580733918]
We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem.
Our experiments show that, an small graph data structure built from a single frame, allows to approximate an instantaneous audio-visual assignment problem.
arXiv Detail & Related papers (2021-01-11T02:57:25Z) - Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation
of Facial Gestures in Dyadic Settings [11.741529272872219]
To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors.
Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synthesizing non-verbal behavior.
We introduce a probabilistic method to synthesize interlocutor-aware facial gestures in dyadic conversations.
arXiv Detail & Related papers (2020-06-11T14:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.