Related papers: Auto-Regressive Diffusion for Generating 3D Human-Object Interactions

Auto-Regressive Diffusion for Generating 3D Human-Object Interactions

URL: http://arxiv.org/abs/2503.16801v1
Date: Fri, 21 Mar 2025 02:25:59 GMT
Title: Auto-Regressive Diffusion for Generating 3D Human-Object Interactions
Authors: Zichen Geng, Zeeshan Hayder, Wei Liu, Ajmal Saeed Mian,
Abstract summary: Key challenge in HOI generation is maintaining interaction consistency in long sequences.<n>We propose an autoregressive diffusion model (ARDHOI) that predicts the next continuous token.<n>Our model has been evaluated on the OMOMO and BEHAVE datasets.
Score: 5.587507490937267
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-driven Human-Object Interaction (Text-to-HOI) generation is an emerging field with applications in animation, video games, virtual reality, and robotics. A key challenge in HOI generation is maintaining interaction consistency in long sequences. Existing Text-to-Motion-based approaches, such as discrete motion tokenization, cannot be directly applied to HOI generation due to limited data in this domain and the complexity of the modality. To address the problem of interaction consistency in long sequences, we propose an autoregressive diffusion model (ARDHOI) that predicts the next continuous token. Specifically, we introduce a Contrastive Variational Autoencoder (cVAE) to learn a physically plausible space of continuous HOI tokens, thereby ensuring that generated human-object motions are realistic and natural. For generating sequences autoregressively, we develop a Mamba-based context encoder to capture and maintain consistent sequential actions. Additionally, we implement an MLP-based denoiser to generate the subsequent token conditioned on the encoded context. Our model has been evaluated on the OMOMO and BEHAVE datasets, where it outperforms existing state-of-the-art methods in terms of both performance and inference speed. This makes ARDHOI a robust and efficient solution for text-driven HOI tasks

Related papers

ARIG: Autoregressive Interactive Head Generation for Real-time Conversations [15.886402427095515]
Face-to-face communication, as a common human activity, motivates the research on interactive head generation.<n>Previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition.<n>We propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism.
arXiv Detail & Related papers (2025-07-01T06:38:14Z)
Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation [25.770855154106453]
We introduce an Efficient Explicit Joint-level Interaction Model (EJIM) for generating text-guided human-object interactions. EJIM features a Dual-branch HOI Mamba that separately and efficiently models human object and motions. We show that EJIM surpasses previous works by a large margin while using only 5% of the inference time.
arXiv Detail & Related papers (2025-03-29T15:23:21Z)
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression [23.99292102237088]
We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics.<n>After post-training, this model can be used as a video simulator for evaluating policies and generating synthetic data.
arXiv Detail & Related papers (2025-02-06T18:38:26Z)
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding motion of a driving object or actor.<n>Our key insight is that large video generation models can act as both neurals and implicit physics simulators, having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z)
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects [86.86284624825356]
HIMO is a dataset of full-body human interacting with multiple objects. HIMO contains 3.3K 4D HOI sequences and 4.08M 3D HOI frames.
arXiv Detail & Related papers (2024-07-17T07:47:34Z)
Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer [62.29951737214263]
Existing algorithms directly generate the full sequence which is expensive and prone to errors. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text. We use a Variationalcoder (VAE) with Kullback-Leibler regularization to project the Autoencoder into a latent space. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the design latents and text condition.
arXiv Detail & Related papers (2024-05-24T11:12:37Z)
HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models [42.62823339416957]
We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object.
arXiv Detail & Related papers (2023-12-11T17:41:17Z)
Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models [71.64318025625833]
This paper presents a novel approach to generating the 3D motion of a human interacting with a target object. Our framework first generates a set of milestones and then synthesizes the motion along them. The experiments on the NSM, COUCH, and SAMP datasets show that our approach outperforms previous methods by a large margin in both quality and diversity.
arXiv Detail & Related papers (2023-10-03T17:50:23Z)
UDE: A Unified Driving Engine for Human Motion Generation [16.32286289924454]
UDE is the first unified driving engine that enables generating human motion sequences from natural language or audio sequences. We evaluate our method on HumanML3DciteGuo_2022_CVPR and AIST++citeli 2021learn benchmarks.
arXiv Detail & Related papers (2022-11-29T08:30:52Z)
End-to-end Contextual Perception and Prediction with Interaction Transformer [79.14001602890417]
We tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving. To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture. Our model can be trained end-to-end, and runs in real-time.
arXiv Detail & Related papers (2020-08-13T14:30:12Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.