Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars
- URL: http://arxiv.org/abs/2602.01538v1
- Date: Mon, 02 Feb 2026 02:12:09 GMT
- Title: Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars
- Authors: Youliang Zhang, Zhengguang Zhou, Zhentao Yu, Ziyao Huang, Teng Hu, Sen Liang, Guozhen Zhang, Ziqiao Peng, Shunkai Li, Yi Chen, Zixiang Zhou, Yuan Zhou, Qinglin Lu, Xiu Li,
- Abstract summary: Existing methods can generate full-body talking avatars with simple human motion.<n>This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation.<n>We propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction.
- Score: 32.76524805419984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io
Related papers
- JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning [18.72712280434528]
JoyAvatar is a framework capable of generating long duration avatar videos.<n>We introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability.<n>During training, we dynamically modulate the strength of multi-modal conditions.
arXiv Detail & Related papers (2026-01-31T13:00:57Z) - Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation [71.38488610271247]
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation.<n>Current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement.<n>We propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing.
arXiv Detail & Related papers (2026-01-02T11:58:48Z) - VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification [65.15340059997273]
VHOI is a framework for creating realistic human-object interactions in video.<n>We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics.<n> Experiments demonstrate state-of-the-art results in controllable HOI video generation.
arXiv Detail & Related papers (2025-12-10T13:40:24Z) - EAI-Avatar: Emotion-Aware Interactive Talking Head Generation [35.56554951482687]
We propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions.<n>Our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states.
arXiv Detail & Related papers (2025-08-25T13:07:03Z) - SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents [91.26239311240873]
SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars.<n>A key innovation is an autonomous verification loop, where the agent renders draft avatars.<n>The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance.
arXiv Detail & Related papers (2025-06-05T03:49:01Z) - HUMOTO: A 4D Dataset of Mocap Human Object Interactions [41.19475872353592]
Human Motions with Objects is a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications.<n>Humoto captures interactions with 63 precisely modeled objects and 72 articulated parts.<n>Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations.
arXiv Detail & Related papers (2025-04-14T16:59:29Z) - Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy [30.43930233035367]
We introduce the first unified physics-based HO framework that leverages Vision-Language Models (VLMs)<n>We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-temporal bipartite motion representation that automatically constructs goal states and reward functions for reinforcement learning.<n>To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans.
arXiv Detail & Related papers (2025-03-24T05:18:04Z) - Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration [28.825612240280822]
We propose a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control.<n>Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions.<n>We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation.
arXiv Detail & Related papers (2025-02-20T18:17:11Z) - AnchorCrafter: Animate Cyber-Anchors Selling Your Products via Human-Object Interacting Video Generation [40.81246588724407]
The generation of anchor-style product promotion videos presents promising opportunities in e-commerce, advertising, and consumer engagement.<n>We introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object.<n>We propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives, and the HOI-motion injection, which enables complex human-object interactions.
arXiv Detail & Related papers (2024-11-26T12:42:13Z) - AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation [60.5897687447003]
AvatarGO is a novel framework designed to generate realistic 4D HOI scenes from textual inputs.
Our framework not only generates coherent compositional motions, but also exhibits greater robustness in handling issues.
As the first attempt to synthesize 4D avatars with object interactions, we hope AvatarGO could open new doors for human-centric 4D content creation.
arXiv Detail & Related papers (2024-10-09T17:58:56Z) - Task-Oriented Human-Object Interactions Generation with Implicit Neural
Representations [61.659439423703155]
TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations.
Our method generates continuous motions that are parameterized only by the temporal coordinate.
This work takes a step further toward general human-scene interaction simulation.
arXiv Detail & Related papers (2023-03-23T09:31:56Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.