Related papers: Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication

Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication

URL: http://arxiv.org/abs/2403.19467v1
Date: Thu, 28 Mar 2024 14:47:32 GMT
Title: Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication
Authors: Mingze Sun, Chao Xu, Xinyu Jiang, Yang Liu, Baigui Sun, Ruqi Huang,
Abstract summary: We introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model. Our approach demonstrates state-of-the-art performance on two benchmark datasets.
Score: 17.294279444027563
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ-VAEs with respect to the holistic motions of both speaker and listener. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model specifically designed to characterize real-world communication scenarios effectively which can generate the motions of both the speaker and the listener simultaneously. These designs ensure that the results we generate are both coordinated and diverse. Our approach demonstrates state-of-the-art performance on two benchmark datasets. Furthermore, we introduce the HoCo holistic communication dataset, which is a valuable resource for future research. Our HoCo dataset and code will be released for research purposes upon acceptance.

Related papers

Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data. We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
It Takes Two: Real-time Co-Speech Two-person's Interaction Generation via Reactive Auto-regressive Diffusion Model [34.94330722832987]
We introduce an audio-driven, auto-regressive system designed to synthesize dynamic movements for two characters during a conversation. To the best of our knowledge, this is the first system capable of generating interactive full-body motions for two characters from speech in an online manner.
arXiv Detail & Related papers (2024-12-03T12:31:44Z)
Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection [24.71649541757314]
Short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection model.
arXiv Detail & Related papers (2024-10-21T11:57:56Z)
Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony [55.26315526382004]
We propose a novel framework, Combo, for co-speech holistic 3D human motion generation. In particular, we identify that one fundamental challenge as the multiple-input-multiple-output nature of the generative model of interest. Combo is highly effective in generating high-quality motions but also efficient in transferring identity and emotion.
arXiv Detail & Related papers (2024-08-18T07:48:49Z)
Agent-driven Generative Semantic Communication with Cross-Modality and Prediction [57.335922373309074]
We propose a novel agent-driven generative semantic communication framework based on reinforcement learning. In this work, we develop an agent-assisted semantic encoder with cross-modality capability, which can track the semantic changes, channel condition, to perform adaptive semantic extraction and sampling. The effectiveness of the designed models has been verified using the UA-DETRAC dataset, demonstrating the performance gains of the overall A-GSC framework.
arXiv Detail & Related papers (2024-04-10T13:24:27Z)
Towards Variable and Coordinated Holistic Co-Speech Motion Generation [21.244719987338243]
This paper addresses the problem of generating lifelike holistic co-speech motions for 3D avatars. We aim to achieve both with ProbTalk, a unified probabilistic framework designed to jointly model facial, hand, and body movements in speech.
arXiv Detail & Related papers (2024-03-30T13:41:57Z)
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z)
HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models [42.62823339416957]
We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object.
arXiv Detail & Related papers (2023-12-11T17:41:17Z)
Interactive Conversational Head Generation [68.76774230274076]
We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation. The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications.
arXiv Detail & Related papers (2023-07-05T08:06:26Z)
Curriculum Learning for Goal-Oriented Semantic Communications with a Common Language [60.85719227557608]
A holistic goal-oriented semantic communication framework is proposed to enable a speaker and a listener to cooperatively execute a set of sequential tasks. A common language based on a hierarchical belief set is proposed to enable semantic communications between speaker and listener. An optimization problem is defined to determine the perfect and abstract description of the events.
arXiv Detail & Related papers (2022-04-21T22:36:06Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.