In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data
- URL: http://arxiv.org/abs/2511.15704v1
- Date: Wed, 19 Nov 2025 18:59:04 GMT
- Title: In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data
- Authors: Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, Xiaolong Wang,
- Abstract summary: Egocentric videos are a valuable and scalable data source to learn manipulation policies.<n>This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task.<n>We show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data.
- Score: 33.674143801589956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. Project website: https://xiongyicai.github.io/In-N-On/
Related papers
- Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation [16.701354625940308]
Humanoid Everyday is a large-scale and diverse humanoid manipulation dataset.<n>It aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations.<n>We conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations.
arXiv Detail & Related papers (2025-10-09T20:43:27Z) - Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions [110.43343503158306]
This paper embeds the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands.<n>Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data.<n>We establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis.
arXiv Detail & Related papers (2025-08-06T17:46:23Z) - EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video [7.1221123957033905]
EgoDex is the largest and most diverse dataset of dexterous human manipulation to date.<n>It has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording.<n>The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks.
arXiv Detail & Related papers (2025-05-16T21:34:47Z) - Humanoid Policy ~ Human Policy [41.34186233320398]
We train a human-humanoid behavior policy, which we term Human Action Transformer (HAT)<n>The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions.<n>We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency.
arXiv Detail & Related papers (2025-03-17T17:59:09Z) - EgoMimic: Scaling Imitation Learning via Egocentric Video [22.902881956495765]
We present EgoMimic, a full-stack framework which scales manipulation via human embodiment data.
EgoMimic achieves this through: (1) a system to capture human embodiment data using the ergonomic Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, and (4) an imitation learning architecture that co-trains on human and robot data.
arXiv Detail & Related papers (2024-10-31T17:59:55Z) - The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences [8.952954042940368]
This dataset includes egocentric videos from children spanning 6 months to 3 years of age in longitudinal, at-home contexts.<n>We train self-supervised language and vision models and evaluate their transfer to out-of-distribution tasks.<n>Our dataset stands as an open challenge for robust, human-like AI systems.
arXiv Detail & Related papers (2024-06-14T23:52:27Z) - Learning Human Action Recognition Representations Without Real Humans [66.61527869763819]
We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model.
We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks.
Our approach outperforms previous baselines by up to 5%.
arXiv Detail & Related papers (2023-11-10T18:38:14Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Behavior Retrieval: Few-Shot Imitation Learning by Querying Unlabeled
Datasets [73.2096288987301]
We propose a simple approach that uses a small amount of downstream expert data to selectively query relevant behaviors from an offline, unlabeled dataset.
We observe that our method learns to query only the relevant transitions to the task, filtering out sub-optimal or task-irrelevant data.
Our simple querying approach outperforms more complex goal-conditioned methods by 20% across simulated and real robotic manipulation tasks from images.
arXiv Detail & Related papers (2023-04-18T05:42:53Z) - Video-based Pose-Estimation Data as Source for Transfer Learning in
Human Activity Recognition [71.91734471596433]
Human Activity Recognition (HAR) using on-body devices identifies specific human actions in unconstrained environments.
Previous works demonstrated that transfer learning is a good strategy for addressing scenarios with scarce data.
This paper proposes using datasets intended for human-pose estimation as a source for transfer learning.
arXiv Detail & Related papers (2022-12-02T18:19:36Z) - What Matters in Learning from Offline Human Demonstrations for Robot
Manipulation [64.43440450794495]
We conduct an extensive study of six offline learning algorithms for robot manipulation.
Our study analyzes the most critical challenges when learning from offline human data.
We highlight opportunities for learning from human datasets.
arXiv Detail & Related papers (2021-08-06T20:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.