Egocentric Prediction of Action Target in 3D
- URL: http://arxiv.org/abs/2203.13116v1
- Date: Thu, 24 Mar 2022 15:16:05 GMT
- Title: Egocentric Prediction of Action Target in 3D
- Authors: Yiming Li and Ziang Cao and Andrew Liang and Benjamin Liang and Luoyao
Chen and Hang Zhao and Chen Feng
- Abstract summary: We propose a large multimodality dataset of more than 1 million frames of RGB-D and IMU streams to stimulate more research on this challenging egocentric vision task.
Our results demonstrate that this new task is worthy of further study by researchers in robotics, vision, and learning communities.
- Score: 17.99025294221712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We are interested in anticipating as early as possible the target location of
a person's object manipulation action in a 3D workspace from egocentric vision.
It is important in fields like human-robot collaboration, but has not yet
received enough attention from vision and learning communities. To stimulate
more research on this challenging egocentric vision task, we propose a large
multimodality dataset of more than 1 million frames of RGB-D and IMU streams,
and provide evaluation metrics based on our high-quality 2D and 3D labels from
semi-automatic annotation. Meanwhile, we design baseline methods using
recurrent neural networks and conduct various ablation studies to validate
their effectiveness. Our results demonstrate that this new task is worthy of
further study by researchers in robotics, vision, and learning communities.
Related papers
- Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects [88.25603931962071]
A holistic 3D understanding of interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation.
We design the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits.
Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks.
arXiv Detail & Related papers (2024-03-25T05:12:21Z) - Generating Human-Centric Visual Cues for Human-Object Interaction
Detection via Large Vision-Language Models [59.611697856666304]
Human-object interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions.
We propose three prompts with VLM to generate human-centric visual cues within an image from multiple perspectives of humans.
We develop a transformer-based multimodal fusion module with multitower architecture to integrate visual cue features into the instance and interaction decoders.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D
Human Keypoints [25.550524178542833]
We propose a novel multi-task learning framework for pedestrian crossing action recognition and trajectory prediction.
We use 3D human keypoints extracted from raw sensor data to capture rich information on human pose and activity.
We show that our approach achieves state-of-the-art performance on a wide range of evaluation metrics.
arXiv Detail & Related papers (2023-06-01T18:27:48Z) - ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding [67.21613160846299]
Embodied Reference Understanding (ERU) is first designed for this concern.
New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
arXiv Detail & Related papers (2023-03-23T11:36:14Z) - Surround-View Vision-based 3D Detection for Autonomous Driving: A Survey [0.6091702876917281]
We provide a literature survey for the existing Vision Based 3D detection methods, focused on autonomous driving.
We have highlighted how the literature and industry trend have moved towards surround-view image based methods and note down thoughts on what special cases this method addresses.
arXiv Detail & Related papers (2023-02-13T19:30:17Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - UnrealEgo: A New Dataset for Robust Egocentric 3D Human Motion Capture [70.59984501516084]
UnrealEgo is a new large-scale naturalistic dataset for egocentric 3D human pose estimation.
It is based on an advanced concept of eyeglasses equipped with two fisheye cameras that can be used in unconstrained environments.
We propose a new benchmark method with a simple but effective idea of devising a 2D keypoint estimation module for stereo inputs to improve 3D human pose estimation.
arXiv Detail & Related papers (2022-08-02T17:59:54Z) - Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem.
We employ a Neural Message Passing network for data association that is fully trainable.
We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z) - Seeing by haptic glance: reinforcement learning-based 3D object
Recognition [31.80213713136647]
Human is able to conduct 3D recognition by a limited number of haptic contacts between the target object and his/her fingers without seeing the object.
This capability is defined as haptic glance' in cognitive neuroscience.
Most of the existing 3D recognition models were developed based on dense 3D data.
In many real-life use cases, where robots are used to collect 3D data by haptic exploration, only a limited number of 3D points could be collected.
A novel reinforcement learning based framework is proposed, where the haptic exploration procedure is optimized simultaneously with the objective 3D recognition with actively collected 3D
arXiv Detail & Related papers (2021-02-15T15:38:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.