Related papers: HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement

HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement

URL: http://arxiv.org/abs/2508.16943v1
Date: Sat, 23 Aug 2025 08:23:14 GMT
Title: HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement
Authors: Haozhuo Zhang, Jingkai Sun, Michele Caprio, Jian Tang, Shanghang Zhang, Qiang Zhang, Wei Pan,
Abstract summary: We introduce HumanoidVerse, a novel framework for vision-language guided humanoid control.<n>HumanoidVerse supports consecutive manipulation of multiple objects, guided only by natural language instructions and egocentric camera RGB observations.<n>Our work represents a key step toward robust, general-purpose humanoid agents capable of executing complex, sequential tasks under real-world sensory constraints.
Score: 51.16740261131198
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce HumanoidVerse, a novel framework for vision-language guided humanoid control that enables a single physically simulated robot to perform long-horizon, multi-object rearrangement tasks across diverse scenes. Unlike prior methods that operate in fixed settings with single-object interactions, our approach supports consecutive manipulation of multiple objects, guided only by natural language instructions and egocentric camera RGB observations. HumanoidVerse is trained via a multi-stage curriculum using a dual-teacher distillation pipeline, enabling fluid transitions between sub-tasks without requiring environment resets. To support this, we construct a large-scale dataset comprising 350 multi-object tasks spanning four room layouts. Extensive experiments in the Isaac Gym simulator demonstrate that our method significantly outperforms prior state-of-the-art in both task success rate and spatial precision, and generalizes well to unseen environments and instructions. Our work represents a key step toward robust, general-purpose humanoid agents capable of executing complex, sequential tasks under real-world sensory constraints. The video visualization results can be found on the project page: https://haozhuo-zhang.github.io/HumanoidVerse-project-page/.

Related papers

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models [31.768426199719816]
We propose EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions.<n>We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives.<n>We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations.
arXiv Detail & Related papers (2026-02-04T13:04:56Z)
Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation [90.90219129619344]
This paper presents a novel R-prior-S, Recurrent Geometric-priormodal Policy with Spiking features.<n>To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases.<n>For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network.
arXiv Detail & Related papers (2026-01-13T23:36:30Z)
SceneFoundry: Generating Interactive Infinite 3D Worlds [22.60801815197924]
SceneFoundry is a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture.<n>Our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions.
arXiv Detail & Related papers (2026-01-09T14:33:10Z)
Is an object-centric representation beneficial for robotic manipulation ? [45.75998994869714]
Object-centric representation (OCR) has recently become a subject of interest in the computer vision community for learning a structured representation of images and videos.<n>We evaluate one classical object-centric method across several generalization scenarios and compare its results against several state-of-the-art hollistic representations.<n>Our results exhibit that existing methods are prone to failure in difficult scenarios involving complex scene structures, whereas object-centric methods help overcome these challenges.
arXiv Detail & Related papers (2025-06-24T08:23:55Z)
You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations [38.835807227433335]
Bimanual robotic manipulation is a long-standing challenge of embodied intelligence.<n>We propose YOTO, which can extract and then inject patterns of bimanual actions from as few as a single binocular observation.<n>YOTO achieves impressive performance in mimicking 5 intricate long-horizon bimanual tasks.
arXiv Detail & Related papers (2025-01-24T03:26:41Z)
Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders. Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z)
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks [52.981128371910266]
We present InstructDiffusion, a framework for aligning computer vision tasks with human instructions. InstructDiffusion could handle a variety of vision tasks, including understanding tasks and generative tasks. It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets.
arXiv Detail & Related papers (2023-09-07T17:56:57Z)
CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation [73.78984332354636]
CorNav is a novel zero-shot framework for vision-and-language navigation. It incorporates environmental feedback for refining future plans and adjusting its actions. It consistently outperforms all baselines in a zero-shot multi-task setting.
arXiv Detail & Related papers (2023-06-17T11:44:04Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z)
Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task. DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos. DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.