Related papers: Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

URL: http://arxiv.org/abs/2602.10659v1
Date: Wed, 11 Feb 2026 09:04:28 GMT
Title: Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation
Authors: Yin Wang, Ziyao Zhang, Zhiying Leng, Haitian Liu, Frederick W. B. Li, Mu Li, Xiaohui Liang,
Abstract summary: We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation.<n>Existing methods primarily rely on a direct text-to-HOI mapping.<n>We propose MP-HOI, a novel framework grounded in four core insights.
Score: 26.16137102387553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.

Related papers

Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing [20.40288070674112]
We propose an end-to-end Interaction-aware Transformer (InterFormer)<n>It integrates three key components, i.e., a Dynamic Query Generator (DQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss.<n>Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets.
arXiv Detail & Related papers (2026-02-24T06:39:18Z)
UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework [54.337290937468175]
We propose UniMo, an autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework.<n>We show that our method simultaneously generates corresponding videos and motions while performing accurate motion capture.
arXiv Detail & Related papers (2025-12-03T16:03:18Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z)
DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data [67.99373622902827]
DIPO is a framework for controllable generation of articulated 3D objects from a pair of images.<n>We propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters.<n>We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions.
arXiv Detail & Related papers (2025-05-26T18:55:14Z)
MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation [28.75149480374178]
MEgoHand is a framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose.<n>It achieves substantial reductions in wrist translation error and joint rotation error, highlighting its capacity to accurately model fine-grained hand joint structures.
arXiv Detail & Related papers (2025-05-22T12:37:47Z)
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects [86.86284624825356]
HIMO is a dataset of full-body human interacting with multiple objects. HIMO contains 3.3K 4D HOI sequences and 4.08M 3D HOI frames.
arXiv Detail & Related papers (2024-07-17T07:47:34Z)
HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models [42.62823339416957]
We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts.<n>We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text.<n>We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object.
arXiv Detail & Related papers (2023-12-11T17:41:17Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion [29.25063155767897]
This paper addresses a novel task of anticipating 3D human-object interactions (HOIs) Our task is significantly more challenging, as it requires modeling dynamic objects with various shapes, capturing whole-body motion, and ensuring physically valid interactions. Experiments on multiple human-object interaction datasets demonstrate the effectiveness of our method for this task, capable of producing realistic, vivid, and remarkably long-term 3D HOI predictions.
arXiv Detail & Related papers (2023-08-31T17:59:08Z)
Stochastic Multi-Person 3D Motion Forecasting [21.915057426589744]
We deal with the ignored real-world complexities in prior work on human motion forecasting. Our framework is general; we instantiate it with different generative models. Our approach produces diverse and accurate multi-person predictions, significantly outperforming the state of the art.
arXiv Detail & Related papers (2023-06-08T17:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.