HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion Generation
- URL: http://arxiv.org/abs/2601.20383v1
- Date: Wed, 28 Jan 2026 08:47:23 GMT
- Title: HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion Generation
- Authors: Mengge Liu, Yan Di, Gu Wang, Yun Qu, Dekai Zhu, Yanyan Li, Xiangyang Ji,
- Abstract summary: We introduce HINT, the first autoregressive framework for multi-human motion generation with Hierarchical INTeraction modeling in diffusion.<n>First, HINT leverages a disentangled motion representation within a canonicalized latent space, decoupling local motion semantics from inter-person interactions.<n>Second, HINT adopts a sliding-window strategy for efficient online generation, and aggregates local within-window and global cross-window conditions to capture past human history, inter-person dependencies, and align with text guidance.
- Score: 55.73037290387896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-driven multi-human motion generation with complex interactions remains a challenging problem. Despite progress in performance, existing offline methods that generate fixed-length motions with a fixed number of agents, are inherently limited in handling long or variable text, and varying agent counts. These limitations naturally encourage autoregressive formulations, which predict future motions step by step conditioned on all past trajectories and current text guidance. In this work, we introduce HINT, the first autoregressive framework for multi-human motion generation with Hierarchical INTeraction modeling in diffusion. First, HINT leverages a disentangled motion representation within a canonicalized latent space, decoupling local motion semantics from inter-person interactions. This design facilitates direct adaptation to varying numbers of human participants without requiring additional refinement. Second, HINT adopts a sliding-window strategy for efficient online generation, and aggregates local within-window and global cross-window conditions to capture past human history, inter-person dependencies, and align with text guidance. This strategy not only enables fine-grained interaction modeling within each window but also preserves long-horizon coherence across all the long sequence. Extensive experiments on public benchmarks demonstrate that HINT matches the performance of strong offline models and surpasses autoregressive baselines. Notably, on InterHuman, HINT achieves an FID of 3.100, significantly improving over the previous state-of-the-art score of 5.154.
Related papers
- Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models [80.28579390566298]
We introduce Interact2Ar, a text-conditioned autoregressive diffusion model for generating full-body, human-human interactions.<n>Hand kinematics are incorporated through dedicated parallel branches, enabling high-fidelity full-body generation.<n>Our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios.
arXiv Detail & Related papers (2025-12-22T18:59:50Z) - Diffusion Forcing for Multi-Agent Interaction Sequence Modeling [52.769202433667125]
MAGNet is a unified autoregressive diffusion framework for multi-agent motion generation.<n>It supports a wide range of interaction tasks through flexible conditioning and sampling.<n>It captures both tightly synchronized activities and loosely structured social interactions.
arXiv Detail & Related papers (2025-12-19T18:59:02Z) - InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs [72.5651722107621]
InterAgent is an end-to-end framework for text-driven physics-based multi-agent humanoid control.<n>We introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to cross-modal interference.<n>We also propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies.
arXiv Detail & Related papers (2025-12-08T10:46:01Z) - Fine-grained text-driven dual-human motion generation via dynamic hierarchical interaction [31.055662466004254]
We propose a fine-grained dual-human motion generation method, namely FineDual, to model dynamic hierarchical interaction.<n>The first stage, Self-Learning Stage, divides the dual-human overall text into individual texts.<n>The second stage, Adaptive Adjustment Stage, predicts interaction distance by an interaction distance predictor.<n>The last stage, Teacher-Guided Refinement Stage, utilizes overall text features as guidance to refine motion features at the overall level.
arXiv Detail & Related papers (2025-10-09T14:18:53Z) - Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation [39.67266918328847]
We propose Text2 framework designed to generate realistic text human-human interactions.<n>We present InterCompose, a synthesis-by-composition pipeline that aligns interaction descriptions with strong singleperson motion priors.<n>We also propose InterActor, a text-to-interaction model with word-level conditioning that preserves token-level cues.
arXiv Detail & Related papers (2025-10-07T22:41:23Z) - MoReact: Generating Reactive Motion from Textual Descriptions [57.642436102978245]
MoReact is a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially.<n>Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach.
arXiv Detail & Related papers (2025-09-28T14:31:41Z) - ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction [84.90394416593624]
Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions.<n>Existing simulation-based data generation methods rely heavily on costly autoregressive interactions between multiple agents.<n>We propose a novel Non-Autoregressive Iterative Generation framework, called ToolACE-MT, for constructing high-quality multi-turn agentic dialogues.
arXiv Detail & Related papers (2025-08-18T07:38:23Z) - Auto-Regressive Diffusion for Generating 3D Human-Object Interactions [5.587507490937267]
Key challenge in HOI generation is maintaining interaction consistency in long sequences.<n>We propose an autoregressive diffusion model (ARDHOI) that predicts the next continuous token.<n>Our model has been evaluated on the OMOMO and BEHAVE datasets.
arXiv Detail & Related papers (2025-03-21T02:25:59Z) - Two-in-One: Unified Multi-Person Interactive Motion Generation by Latent Diffusion Transformer [24.166147954731652]
Multi-person interactive motion generation is a critical yet under-explored domain in computer character animation.<n>Current research often employs separate module branches for individual motions, leading to a loss of interaction information.<n>We propose a novel, unified approach that models multi-person motions and their interactions within a single latent space.
arXiv Detail & Related papers (2024-12-21T15:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.