MCM: Multi-condition Motion Synthesis Framework for Multi-scenario
- URL: http://arxiv.org/abs/2309.03031v1
- Date: Wed, 6 Sep 2023 14:17:49 GMT
- Title: MCM: Multi-condition Motion Synthesis Framework for Multi-scenario
- Authors: Zeyu Ling, Bo Han, Yongkang Wong, Mohan Kangkanhalli, Weidong Geng
- Abstract summary: We introduce MCM, a novel paradigm for motion synthesis that spans multiple scenarios under diverse conditions.
The MCM framework is able to integrate with any DDPM-like diffusion model to accommodate multi-conditional information input.
Our approach achieves SoTA results in both text-to-motion and competitive results in music-to-dance tasks, comparable to task-specific methods.
- Score: 28.33039094451924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of the multi-condition human motion synthesis task is to
incorporate diverse conditional inputs, encompassing various forms like text,
music, speech, and more. This endows the task with the capability to adapt
across multiple scenarios, ranging from text-to-motion and music-to-dance,
among others. While existing research has primarily focused on single
conditions, the multi-condition human motion generation remains underexplored.
In this paper, we address these challenges by introducing MCM, a novel paradigm
for motion synthesis that spans multiple scenarios under diverse conditions.
The MCM framework is able to integrate with any DDPM-like diffusion model to
accommodate multi-conditional information input while preserving its generative
capabilities. Specifically, MCM employs two-branch architecture consisting of a
main branch and a control branch. The control branch shares the same structure
as the main branch and is initialized with the parameters of the main branch,
effectively maintaining the generation ability of the main branch and
supporting multi-condition input. We also introduce a Transformer-based
diffusion model MWNet (DDPM-like) as our main branch that can capture the
spatial complexity and inter-joint correlations in motion sequences through a
channel-dimension self-attention module. Quantitative comparisons demonstrate
that our approach achieves SoTA results in both text-to-motion and competitive
results in music-to-dance tasks, comparable to task-specific methods.
Furthermore, the qualitative evaluation shows that MCM not only streamlines the
adaptation of methodologies originally designed for text-to-motion tasks to
domains like music-to-dance and speech-to-gesture, eliminating the need for
extensive network re-configurations but also enables effective multi-condition
modal control, realizing "once trained is motion need".
Related papers
- Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD)
It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images.
A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z) - MCM: Multi-condition Motion Synthesis Framework [15.726843047963664]
Conditional human motion synthesis (HMS) aims to generate human motion sequences that conform to specific conditions.
We propose a multi-condition HMS framework, termed MCM, based on a dual-branch structure composed of a main branch and a control branch.
arXiv Detail & Related papers (2024-04-19T13:40:25Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis [17.45562922442149]
We introduce a cohesive and scalable approach that consolidates multimodal (text, music, speech) and multi-part (hand, torso) human motion generation.
Our method frames the multimodal motion generation challenge as a token prediction task, drawing from specialized codebooks based on the modality of the control signal.
arXiv Detail & Related papers (2023-11-28T04:13:49Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - DiverseMotion: Towards Diverse Human Motion Generation via Discrete
Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions.
We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z) - Revisiting Disentanglement and Fusion on Modality and Context in
Conversational Multimodal Emotion Recognition [81.2011058113579]
We argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps.
We propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism ( CRM) for multimodal and context integration.
Our system achieves new state-of-the-art performance consistently.
arXiv Detail & Related papers (2023-08-08T18:11:27Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Multi-modal Fusion for Single-Stage Continuous Gesture Recognition [45.19890687786009]
We introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF)
TMMF can detect and classify multiple gestures in a video via a single model.
This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step.
arXiv Detail & Related papers (2020-11-10T07:09:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.