Related papers: Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication

Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication

URL: http://arxiv.org/abs/2505.04996v1
Date: Thu, 08 May 2025 07:00:58 GMT
Title: Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication
Authors: Jinhe Huang, Yongkang Cheng, Yuming Hang, Gaoge Han, Jinewei Li, Jing Zhang, Xingjian Gu,
Abstract summary: This paper proposes an Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication.<n>For the first time, we integrate the full-body gestures of listeners into the generation framework.
Score: 4.49451692966442
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Full-body gestures play a pivotal role in natural interactions and are crucial for achieving effective communication. Nevertheless, most existing studies primarily focus on the gesture generation of speakers, overlooking the vital role of listeners in the interaction process and failing to fully explore the dynamic interaction between them. This paper innovatively proposes an Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication. For the first time, we integrate the full-body gestures of listeners into the generation framework. By devising a novel inter-diffusion mechanism, this model can accurately capture the complex interaction patterns between speakers and listeners during communication. In the model construction process, based on the advanced diffusion model architecture, we innovatively introduce interaction conditions and the GAN model to increase the denoising step size. As a result, when generating gesture sequences, the model can not only dynamically generate based on the speaker's speech information but also respond in realtime to the listener's feedback, enabling synergistic interaction between the two. Abundant experimental results demonstrate that compared with the current state-of-the-art gesture generation methods, the model we proposed has achieved remarkable improvements in the naturalness, coherence, and speech-gesture synchronization of the generated gestures. In the subjective evaluation experiments, users highly praised the generated interaction scenarios, believing that they are closer to real life human communication situations. Objective index evaluations also show that our model outperforms the baseline methods in multiple key indicators, providing more powerful support for effective communication.

Related papers

Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning [66.52010873968383]
We introduce a conversational agent that interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through reinforcement learning (RL) training.<n>The experimental results across four widely used conversational benchmarks demonstrate the effectiveness of our methods.
arXiv Detail & Related papers (2026-01-19T14:55:54Z)
Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models [80.28579390566298]
We introduce Interact2Ar, a text-conditioned autoregressive diffusion model for generating full-body, human-human interactions.<n>Hand kinematics are incorporated through dedicated parallel branches, enabling high-fidelity full-body generation.<n>Our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios.
arXiv Detail & Related papers (2025-12-22T18:59:50Z)
InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation [1.7523719472700858]
We introduce InteracTalker, a novel framework that seamlessly integrates prompt-based object-aware interactions with co-speech gesture generation.<n>Our framework utilizes a Generalized Motion Adaptation Module that enables independent training, adapting to the corresponding motion condition.<n>InteracTalker successfully unifies these previously separate tasks, outperforming prior methods in both co-speech gesture generation and object-interaction synthesis.
arXiv Detail & Related papers (2025-12-14T12:29:49Z)
Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents [13.902411927285328]
Social Agent is a novel framework for synthesizing realistic and contextually appropriate co-speech nonverbal behaviors in dyadic conversations.<n>We develop an agentic system driven by a Large Language Model (LLM) to direct the conversation flow and determine appropriate interactive behaviors for both participants.<n>We propose a novel dual-person gesture generation model based on an auto-regressive diffusion model, which synthesizes coordinated motions from speech signals.
arXiv Detail & Related papers (2025-10-06T09:41:37Z)
FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z)
Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z)
Aligning Spoken Dialogue Models from User Interactions [55.192134724622235]
We propose a novel preference alignment framework to improve spoken dialogue models on realtime conversations from user interactions.<n>We create a dataset of more than 150,000 preference pairs from raw multi-turn speech conversations annotated with AI feedback.<n>Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
arXiv Detail & Related papers (2025-06-26T16:45:20Z)
AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [65.53676584955686]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z)
Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection [24.71649541757314]
Short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue.<n>This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection model.
arXiv Detail & Related papers (2024-10-21T11:57:56Z)
PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, integrating psychology-grounded principles of personality: social practice, consistency, and dynamic development. We incorporate personality traits directly into the model parameters, enhancing the model's resistance to induction, promoting consistency, and supporting the dynamic evolution of personality.
arXiv Detail & Related papers (2024-07-17T08:13:22Z)
Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication [17.294279444027563]
We introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model. Our approach demonstrates state-of-the-art performance on two benchmark datasets.
arXiv Detail & Related papers (2024-03-28T14:47:32Z)
Dyadic Interaction Modeling for Social Behavior Generation [6.626277726145613]
We present an effective framework for creating 3D facial motions in dyadic interactions. The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach. Experiments demonstrate the superiority of our framework in generating listener motions.
arXiv Detail & Related papers (2024-03-14T03:21:33Z)
InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions [49.097973114627344]
We present InterGen, an effective diffusion-based approach that incorporates human-to-human interactions into the motion diffusion process. We first contribute a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal motions and 23,337 natural language descriptions. We propose a novel representation for motion input in our interaction diffusion model, which explicitly formulates the global relations between the two performers in the world frame.
arXiv Detail & Related papers (2023-04-12T08:12:29Z)
A Probabilistic Model Of Interaction Dynamics for Dyadic Face-to-Face Settings [1.9544213396776275]
We develop a probabilistic model to capture the interaction dynamics between pairs of participants in a face-to-face setting. This interaction encoding is then used to influence the generation when predicting one agent's future dynamics. We show that our model successfully delineates between the modes, based on their interacting dynamics.
arXiv Detail & Related papers (2022-07-10T23:31:27Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
VIRT: Improving Representation-based Models for Text Matching through Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models. VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.