InterMask: 3D Human Interaction Generation via Collaborative Masked Modelling
- URL: http://arxiv.org/abs/2410.10010v2
- Date: Wed, 16 Oct 2024 23:22:41 GMT
- Title: InterMask: 3D Human Interaction Generation via Collaborative Masked Modelling
- Authors: Muhammad Gohar Javed, Chuan Guo, Li Cheng, Xingyu Li,
- Abstract summary: We introduce InterMask, a novel framework for generating human interactions using masked modeling in discrete space.
InterMask utilizes a generative masked modeling framework to collaboratively model the tokens of two interacting individuals.
With its enhanced motion representation, dedicated architecture, and effective learning strategy, InterMask achieves high-fidelity and diverse human interactions.
- Score: 27.544827331337178
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating realistic 3D human-human interactions from textual descriptions remains a challenging task. Existing approaches, typically based on diffusion models, often generate unnatural and unrealistic results. In this work, we introduce InterMask, a novel framework for generating human interactions using collaborative masked modeling in discrete space. InterMask first employs a VQ-VAE to transform each motion sequence into a 2D discrete motion token map. Unlike traditional 1D VQ token maps, it better preserves fine-grained spatio-temporal details and promotes spatial awareness within each token. Building on this representation, InterMask utilizes a generative masked modeling framework to collaboratively model the tokens of two interacting individuals. This is achieved by employing a transformer architecture specifically designed to capture complex spatio-temporal interdependencies. During training, it randomly masks the motion tokens of both individuals and learns to predict them. In inference, starting from fully masked sequences, it progressively fills in the tokens for both individuals. With its enhanced motion representation, dedicated architecture, and effective learning strategy, InterMask achieves state-of-the-art results, producing high-fidelity and diverse human interactions. It outperforms previous methods, achieving an FID of $5.154$ (vs $5.535$ for in2IN) on the InterHuman dataset and $0.399$ (vs $5.207$ for InterGen) on the InterX dataset. Additionally, InterMask seamlessly supports reaction generation without the need for model redesign or fine-tuning.
Related papers
- Text-driven Human Motion Generation with Motion Masked Diffusion Model [23.637853270123045]
Text human motion generation is a task that synthesizes human motion sequences conditioned on natural language.
Current diffusion model-based approaches have outstanding performance in the diversity and multimodality of generation.
We propose Motion Masked Diffusion Model bftext(MMDM), a novel human motion mechanism for diffusion model.
arXiv Detail & Related papers (2024-09-29T12:26:24Z) - Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony [55.26315526382004]
We propose a novel framework, Combo, for co-speech holistic 3D human motion generation.
In particular, we identify that one fundamental challenge as the multiple-input-multiple-output nature of the generative model of interest.
Combo is highly effective in generating high-quality motions but also efficient in transferring identity and emotion.
arXiv Detail & Related papers (2024-08-18T07:48:49Z) - HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects [86.86284624825356]
HIMO is a dataset of full-body human interacting with multiple objects.
HIMO contains 3.3K 4D HOI sequences and 4.08M 3D HOI frames.
arXiv Detail & Related papers (2024-07-17T07:47:34Z) - InterAct: Capture and Modelling of Realistic, Expressive and Interactive Activities between Two Persons in Daily Scenarios [12.300105542672163]
We capture 241 motion sequences where two persons perform a realistic scenario over the whole sequence.
The audios, body motions, and facial expressions of both persons are all captured in our dataset.
We also demonstrate the first diffusion model based approach that directly estimates the interactive motions between two persons from their audios alone.
arXiv Detail & Related papers (2024-05-19T22:35:02Z) - in2IN: Leveraging individual Information to Generate Human INteractions [29.495166514135295]
We introduce in2IN, a novel diffusion model for human-human motion generation conditioned on individual descriptions.
We also propose DualMDM, a model composition technique that combines the motions generated with in2IN and the motions generated by a single-person motion prior pre-trained on HumanML3D.
arXiv Detail & Related papers (2024-04-15T17:59:04Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.
Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - MoMask: Generative Masked Modeling of 3D Human Motions [25.168781728071046]
MoMask is a novel framework for text-driven 3D human motion generation.
A hierarchical quantization scheme is employed to represent human motion as discrete motion tokens.
MoMask outperforms state-of-art methods on the text-to-motion generation task.
arXiv Detail & Related papers (2023-11-29T19:04:10Z) - Multimodal Diffusion Segmentation Model for Object Segmentation from
Manipulation Instructions [0.0]
We develop a model that comprehends a natural language instruction and generates a segmentation mask for the target everyday object.
We build a new dataset based on the well-known Matterport3D and REVERIE datasets.
The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.
arXiv Detail & Related papers (2023-07-17T16:07:07Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Interaction Replica: Tracking Human-Object Interaction and Scene Changes From Human Motion [48.982957332374866]
Modeling changes caused by humans is essential for building digital twins.
Our method combines visual localization of humans in the scene with contact-based reasoning about human-scene interactions from IMU data.
Our code, data and model are available on our project page at http://virtualhumans.mpi-inf.mpg.de/ireplica/.
arXiv Detail & Related papers (2022-05-05T17:58:06Z) - BEHAVE: Dataset and Method for Tracking Human Object Interactions [105.77368488612704]
We present the first full body human- object interaction dataset with multi-view RGBD frames and corresponding 3D SMPL and object fits along with the annotated contacts between them.
We use this data to learn a model that can jointly track humans and objects in natural environments with an easy-to-use portable multi-camera setup.
arXiv Detail & Related papers (2022-04-14T13:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.