EgoMe: A New Dataset and Challenge for Following Me via Egocentric View in Real World
- URL: http://arxiv.org/abs/2501.19061v2
- Date: Sun, 30 Mar 2025 02:44:43 GMT
- Title: EgoMe: A New Dataset and Challenge for Following Me via Egocentric View in Real World
- Authors: Heqian Qiu, Zhaofeng Shi, Lanxiao Wang, Huiyu Xiong, Xiang Li, Hongliang Li,
- Abstract summary: In human imitation learning, the imitator typically take the egocentric view as a benchmark, naturally transferring behaviors observed from an exocentric view to their owns.<n>We introduce EgoMe, which towards following the process of human imitation learning via the imitator's egocentric view in the real world.<n>Our dataset includes 7902 paired exo-ego videos spanning diverse daily behaviors in various real-world scenarios.
- Score: 12.699670048897085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In human imitation learning, the imitator typically take the egocentric view as a benchmark, naturally transferring behaviors observed from an exocentric view to their owns, which provides inspiration for researching how robots can more effectively imitate human behavior. However, current research primarily focuses on the basic alignment issues of ego-exo data from different cameras, rather than collecting data from the imitator's perspective, which is inconsistent with the high-level cognitive process. To advance this research, we introduce a novel large-scale egocentric dataset, called EgoMe, which towards following the process of human imitation learning via the imitator's egocentric view in the real world. Our dataset includes 7902 paired exo-ego videos (totaling15804 videos) spanning diverse daily behaviors in various real-world scenarios. For each video pair, one video captures an exocentric view of the imitator observing the demonstrator's actions, while the other captures an egocentric view of the imitator subsequently following those actions. Notably, EgoMe uniquely incorporates exo-ego eye gaze, other multi-modal sensor IMU data and different-level annotations for assisting in establishing correlations between observing and imitating process. We further provide a suit of challenging benchmarks for fully leveraging this data resource and promoting the robot imitation learning research. Extensive analysis demonstrates significant advantages over existing datasets. Our EgoMe dataset and benchmarks are available at https://huggingface.co/datasets/HeqianQiu/EgoMe.
Related papers
- Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding [69.96199605596138]
Current MLLMs primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos.
We propose learning the mapping between exocentric and egocentric domains to enhance egocentric video understanding.
We introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs.
arXiv Detail & Related papers (2025-03-12T08:10:33Z) - EgoMimic: Scaling Imitation Learning via Egocentric Video [22.902881956495765]
We present EgoMimic, a full-stack framework which scales manipulation via human embodiment data.
EgoMimic achieves this through: (1) a system to capture human embodiment data using the ergonomic Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, and (4) an imitation learning architecture that co-trains on human and robot data.
arXiv Detail & Related papers (2024-10-31T17:59:55Z) - Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning.
Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities.
By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z) - EgoPet: Egomotion and Interaction Data from an Animal's Perspective [82.7192364237065]
We introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction.
EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles.
We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion.
arXiv Detail & Related papers (2024-04-15T17:59:47Z) - EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World [44.34800426136217]
We introduce EgoExoLearn, a dataset that emulates the human demonstration following process.
EgoExoLearn contains egocentric and demonstration video data spanning 120 hours.
We present benchmarks such as cross-view association, cross-view action planning, and cross-view referenced skill assessment.
arXiv Detail & Related papers (2024-03-24T15:00:44Z) - EgoGen: An Egocentric Synthetic Data Generator [53.32942235801499]
EgoGen is a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks.
At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment.
We demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views.
arXiv Detail & Related papers (2024-01-16T18:55:22Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - Ego-Body Pose Estimation via Ego-Head Pose Estimation [22.08240141115053]
Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR.
We propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation.
This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion.
arXiv Detail & Related papers (2022-12-09T02:25:20Z) - Ego-Exo: Transferring Visual Representations from Third-person to
First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets.
Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties.
Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.