Egocentric Prediction of Action Target in 3D
- URL: http://arxiv.org/abs/2203.13116v1
- Date: Thu, 24 Mar 2022 15:16:05 GMT
- Title: Egocentric Prediction of Action Target in 3D
- Authors: Yiming Li and Ziang Cao and Andrew Liang and Benjamin Liang and Luoyao
Chen and Hang Zhao and Chen Feng
- Abstract summary: We propose a large multimodality dataset of more than 1 million frames of RGB-D and IMU streams to stimulate more research on this challenging egocentric vision task.
Our results demonstrate that this new task is worthy of further study by researchers in robotics, vision, and learning communities.
- Score: 17.99025294221712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We are interested in anticipating as early as possible the target location of
a person's object manipulation action in a 3D workspace from egocentric vision.
It is important in fields like human-robot collaboration, but has not yet
received enough attention from vision and learning communities. To stimulate
more research on this challenging egocentric vision task, we propose a large
multimodality dataset of more than 1 million frames of RGB-D and IMU streams,
and provide evaluation metrics based on our high-quality 2D and 3D labels from
semi-automatic annotation. Meanwhile, we design baseline methods using
recurrent neural networks and conduct various ablation studies to validate
their effectiveness. Our results demonstrate that this new task is worthy of
further study by researchers in robotics, vision, and learning communities.
Related papers
- SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories [124.24041272390954]
Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems.<n>We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image.<n>We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database.
arXiv Detail & Related papers (2025-03-28T20:53:20Z) - A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear.
We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck.
Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z) - Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects [89.95728475983263]
holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation.
We design the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits.
Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks.
arXiv Detail & Related papers (2024-03-25T05:12:21Z) - Egocentric RGB+Depth Action Recognition in Industry-Like Settings [50.38638300332429]
Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment.
Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively.
Our method also secured first place at the multimodal action recognition challenge at ICIAP 2023.
arXiv Detail & Related papers (2023-09-25T08:56:22Z) - Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D
Human Keypoints [25.550524178542833]
We propose a novel multi-task learning framework for pedestrian crossing action recognition and trajectory prediction.
We use 3D human keypoints extracted from raw sensor data to capture rich information on human pose and activity.
We show that our approach achieves state-of-the-art performance on a wide range of evaluation metrics.
arXiv Detail & Related papers (2023-06-01T18:27:48Z) - ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding [67.21613160846299]
Embodied Reference Understanding (ERU) is first designed for this concern.
New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
arXiv Detail & Related papers (2023-03-23T11:36:14Z) - Surround-View Vision-based 3D Detection for Autonomous Driving: A Survey [0.6091702876917281]
We provide a literature survey for the existing Vision Based 3D detection methods, focused on autonomous driving.
We have highlighted how the literature and industry trend have moved towards surround-view image based methods and note down thoughts on what special cases this method addresses.
arXiv Detail & Related papers (2023-02-13T19:30:17Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - UnrealEgo: A New Dataset for Robust Egocentric 3D Human Motion Capture [70.59984501516084]
UnrealEgo is a new large-scale naturalistic dataset for egocentric 3D human pose estimation.
It is based on an advanced concept of eyeglasses equipped with two fisheye cameras that can be used in unconstrained environments.
We propose a new benchmark method with a simple but effective idea of devising a 2D keypoint estimation module for stereo inputs to improve 3D human pose estimation.
arXiv Detail & Related papers (2022-08-02T17:59:54Z) - Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem.
We employ a Neural Message Passing network for data association that is fully trainable.
We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z) - Seeing by haptic glance: reinforcement learning-based 3D object
Recognition [31.80213713136647]
Human is able to conduct 3D recognition by a limited number of haptic contacts between the target object and his/her fingers without seeing the object.
This capability is defined as haptic glance' in cognitive neuroscience.
Most of the existing 3D recognition models were developed based on dense 3D data.
In many real-life use cases, where robots are used to collect 3D data by haptic exploration, only a limited number of 3D points could be collected.
A novel reinforcement learning based framework is proposed, where the haptic exploration procedure is optimized simultaneously with the objective 3D recognition with actively collected 3D
arXiv Detail & Related papers (2021-02-15T15:38:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.