VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction
- URL: http://arxiv.org/abs/2504.21718v1
- Date: Wed, 30 Apr 2025 15:05:12 GMT
- Title: VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction
- Authors: Shiying Li, Xingqun Qi, Bingkun Yang, Chen Weile, Zezhao Tian, Muyi Sun, Qifeng Liu, Man Zhang, Zhenan Sun,
- Abstract summary: We propose VividListener, a novel framework enabling fine-grained, expressive and controllable listener dynamics modeling.<n>VividListener achieves state-of-the-art performance, realizing expressive and controllable listener dynamics.
- Score: 31.307004436877587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating responsive listener head dynamics with nuanced emotions and expressive reactions is crucial for practical dialogue modeling in various virtual avatar animations. Previous studies mainly focus on the direct short-term production of listener behavior. They overlook the fine-grained control over motion variations and emotional intensity, especially in long-sequence modeling. Moreover, the lack of long-term and large-scale paired speaker-listener corpora including head dynamics and fine-grained multi-modality annotations (e.g., text-based expression descriptions, emotional intensity) also limits the application of dialogue modeling.Therefore, we first newly collect a large-scale multi-turn dataset of 3D dyadic conversation containing more than 1.4M valid frames for multi-modal responsive interaction, dubbed ListenerX. Additionally, we propose VividListener, a novel framework enabling fine-grained, expressive and controllable listener dynamics modeling. This framework leverages multi-modal conditions as guiding principles for fostering coherent interactions between speakers and listeners.Specifically, we design the Responsive Interaction Module (RIM) to adaptively represent the multi-modal interactive embeddings. RIM ensures the listener dynamics achieve fine-grained semantic coordination with textual descriptions and adjustments, while preserving expressive reaction with speaker behavior. Meanwhile, we design the Emotional Intensity Tags (EIT) for emotion intensity editing with multi-modal information integration, applying to both text descriptions and listener motion amplitude.Extensive experiments conducted on our newly collected ListenerX dataset demonstrate that VividListener achieves state-of-the-art performance, realizing expressive and controllable listener dynamics.
Related papers
- Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key conversational behaviors.<n>We aim to advance spoken dialogue modeling and encourage the development of more interactive and natural dialogue systems.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Dyadic Interaction Modeling for Social Behavior Generation [6.626277726145613]
We present an effective framework for creating 3D facial motions in dyadic interactions.
The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach.
Experiments demonstrate the superiority of our framework in generating listener motions.
arXiv Detail & Related papers (2024-03-14T03:21:33Z) - CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation [9.741109135330262]
Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion.
We propose a user-friendly framework called CustomListener to realize the free-form text prior guided listener generation.
arXiv Detail & Related papers (2024-03-01T04:31:56Z) - MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.
MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - Emotional Listener Portrait: Realistic Listener Motion Simulation in
Conversation [50.35367785674921]
Listener head generation centers on generating non-verbal behaviors of a listener in reference to the information delivered by a speaker.
A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation.
We propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords.
Our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude.
arXiv Detail & Related papers (2023-09-29T18:18:32Z) - Speaker-Guided Encoder-Decoder Framework for Emotion Recognition in
Conversation [23.93696773727978]
The emotion recognition in conversation (ERC) task aims to predict the emotion label of an utterance in a conversation.
We design a novel speaker modeling scheme that explores intra- and inter-speaker dependencies jointly in a dynamic manner.
We also propose a Speaker-Guided-Decoder (SGED) framework for ERC, which fully exploits speaker information for the decoding of emotion.
arXiv Detail & Related papers (2022-06-07T10:51:47Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.