Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration
- URL: http://arxiv.org/abs/2502.14795v2
- Date: Fri, 21 Feb 2025 08:09:14 GMT
- Title: Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration
- Authors: Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, Yuefan Wang, Huaicheng Zhou, Wenshuo Feng, Jiacheng Liu, Siteng Huang, Donglin Wang,
- Abstract summary: We propose a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control.<n>Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions.<n>We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation.
- Score: 28.825612240280822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions, allowing the model to learn universal motion patterns and action semantics. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation. Furthermore, we introduce a self-supervised data augmentation strategy that automatically generates pseudoannotations directly derived from motion data. This process converts raw motion sequences into informative question-answer pairs, facilitating the effective use of large-scale unlabeled video data. Built upon whole-body control architectures, extensive experiments show that Humanoid-VLA achieves object interaction and environment exploration tasks with enhanced contextual awareness, demonstrating a more human-like capacity for adaptive and intelligent engagement.
Related papers
- Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions [110.43343503158306]
This paper embeds the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands.<n>Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data.<n>We establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis.
arXiv Detail & Related papers (2025-08-06T17:46:23Z) - DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [56.3802428957899]
We propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling.<n>DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning.<n>Experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks.
arXiv Detail & Related papers (2025-07-06T16:14:29Z) - HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation [26.23483219159567]
HunyuanVideo-HOMA is a weakly conditioned multimodal-driven framework.<n>It encodes appearance and motion signals into the dual input space of a multimodal diffusion transformer.<n>It synthesizes anatomically temporally consistent and physically plausible interactions.
arXiv Detail & Related papers (2025-06-10T13:45:00Z) - ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow [4.2766838326810355]
We present ViSA-Flow, a framework that learns pre-labeled representation from unsupervised large-scale video data.<n>First, a generative-trained semantic action flow is automatically extracted from large-scale human-object interaction video data.<n>Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline.
arXiv Detail & Related papers (2025-05-02T14:03:06Z) - Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy [30.43930233035367]
We introduce the first unified physics-based HO framework that leverages Vision-Language Models (VLMs)<n>We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-temporal bipartite motion representation that automatically constructs goal states and reward functions for reinforcement learning.<n>To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans.
arXiv Detail & Related papers (2025-03-24T05:18:04Z) - HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard [63.54109142085327]
Vision-and-Language Navigation (VLN) systems often focus on either discrete (panoramic) or continuous (free-motion) paradigms alone.
We introduce a unified Human-Aware VLN benchmark that merges these paradigms under explicit social-awareness constraints.
arXiv Detail & Related papers (2025-03-18T13:05:55Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.<n>Our key insight is that large video foundation models can act as both neurals and implicit physics simulators by learning interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation [0.0]
doScenes is a novel dataset designed to facilitate research on human-vehicle instruction interactions.<n>DoScenes bridges the gap between instruction and driving response, enabling context-aware and adaptive planning.
arXiv Detail & Related papers (2024-12-08T11:16:47Z) - IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI [28.160367249993318]
Image-GOal Representations (IGOR) learns a unified, semantically consistent action space across human and various robots.
IGOR enables knowledge transfer among large-scale robot and human activity data.
We believe IGOR opens new possibilities for human-to-robot knowledge transfer and control.
arXiv Detail & Related papers (2024-10-17T13:41:16Z) - InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint [67.6297384588837]
We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs.
We demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model.
arXiv Detail & Related papers (2023-11-27T14:32:33Z) - Object Motion Guided Human Motion Synthesis [22.08240141115053]
We study the problem of full-body human motion synthesis for the manipulation of large-sized objects.
We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework.
We develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated.
arXiv Detail & Related papers (2023-09-28T08:22:00Z) - Task-Oriented Human-Object Interactions Generation with Implicit Neural
Representations [61.659439423703155]
TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations.
Our method generates continuous motions that are parameterized only by the temporal coordinate.
This work takes a step further toward general human-scene interaction simulation.
arXiv Detail & Related papers (2023-03-23T09:31:56Z) - Narrator: Towards Natural Control of Human-Scene Interaction Generation
via Relationship Reasoning [34.00107506891627]
We focus on naturally and controllably generating realistic and diverse HSIs from textual descriptions.
We propose Narrator, a novel relationship reasoning-based generative approach.
Our experiments and perceptual studies show that Narrator can controllably generate diverse interactions and significantly outperform existing works.
arXiv Detail & Related papers (2023-03-16T15:44:15Z) - GIMO: Gaze-Informed Human Motion Prediction in Context [75.52839760700833]
We propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, and ego-centric views with eye gaze.
Our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects.
To realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches.
arXiv Detail & Related papers (2022-04-20T13:17:39Z) - Model Predictive Control for Fluid Human-to-Robot Handovers [50.72520769938633]
Planning motions that take human comfort into account is not a part of the human-robot handover process.
We propose to generate smooth motions via an efficient model-predictive control framework.
We conduct human-to-robot handover experiments on a diverse set of objects with several users.
arXiv Detail & Related papers (2022-03-31T23:08:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.