Egocentric Human Trajectory Forecasting with a Wearable Camera and
Multi-Modal Fusion
- URL: http://arxiv.org/abs/2111.00993v2
- Date: Thu, 4 Nov 2021 13:52:50 GMT
- Title: Egocentric Human Trajectory Forecasting with a Wearable Camera and
Multi-Modal Fusion
- Authors: Jianing Qiu, Lipeng Chen, Xiao Gu, Frank P.-W. Lo, Ya-Yen Tsai,
Jiankai Sun, Jiaqi Liu and Benny Lo
- Abstract summary: We address the problem of forecasting the trajectory of an egocentric camera wearer (ego-person) in crowded spaces.
The trajectory forecasting ability learned from the data of different camera wearers can be transferred to assist visually impaired people in navigation.
A Transformer-based encoder-decoder neural network model, integrated with a novel cascaded cross-attention mechanism has been designed to predict the future trajectory of the camera wearer.
- Score: 24.149925005674145
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we address the problem of forecasting the trajectory of an
egocentric camera wearer (ego-person) in crowded spaces. The trajectory
forecasting ability learned from the data of different camera wearers walking
around in the real world can be transferred to assist visually impaired people
in navigation, as well as to instill human navigation behaviours in mobile
robots, enabling better human-robot interactions. To this end, a novel
egocentric human trajectory forecasting dataset was constructed, containing
real trajectories of people navigating in crowded spaces wearing a camera, as
well as extracted rich contextual data. We extract and utilize three different
modalities to forecast the trajectory of the camera wearer, i.e., his/her past
trajectory, the past trajectories of nearby people, and the environment such as
the scene semantics or the depth of the scene. A Transformer-based
encoder-decoder neural network model, integrated with a novel cascaded
cross-attention mechanism that fuses multiple modalities, has been designed to
predict the future trajectory of the camera wearer. Extensive experiments have
been conducted, and the results have shown that our model outperforms the
state-of-the-art methods in egocentric human trajectory forecasting.
Related papers
- EgoNav: Egocentric Scene-aware Human Trajectory Prediction [15.346096596482857]
Wearable collaborative robots stand to assist human wearers who need fall prevention assistance or wear exoskeletons.
Such a robot needs to be able to constantly adapt to the surrounding scene based on egocentric vision, and predict the ego motion of the wearer.
In this work, we leveraged body-mounted cameras and sensors to anticipate the trajectory of human wearers through complex surroundings.
arXiv Detail & Related papers (2024-03-27T21:43:12Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.
Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - Robots That Can See: Leveraging Human Pose for Trajectory Prediction [30.919756497223343]
We present a Transformer based architecture to predict human future trajectories in human-centric environments.
The resulting model captures the inherent uncertainty for future human trajectory prediction.
We identify new agents with limited historical data as a major contributor to error and demonstrate the complementary nature of 3D skeletal poses in reducing prediction error.
arXiv Detail & Related papers (2023-09-29T13:02:56Z) - Action-conditioned Deep Visual Prediction with RoAM, a new Indoor Human
Motion Dataset for Autonomous Robots [1.7778609937758327]
We introduce the Robot Autonomous Motion (RoAM) video dataset.
It is collected with a custom-made turtlebot3 Burger robot in a variety of indoor environments recording various human motions from the robot's ego-vision.
The dataset also includes synchronized records of the LiDAR scan and all control actions taken by the robot as it navigates around static and moving human agents.
arXiv Detail & Related papers (2023-06-28T00:58:44Z) - COPILOT: Human-Environment Collision Prediction and Localization from
Egocentric Videos [62.34712951567793]
The ability to forecast human-environment collisions from egocentric observations is vital to enable collision avoidance in applications such as VR, AR, and wearable assistive robotics.
We introduce the challenging problem of predicting collisions in diverse environments from multi-view egocentric videos captured from body-mounted cameras.
We propose a transformer-based model called COPILOT to perform collision prediction and localization simultaneously.
arXiv Detail & Related papers (2022-10-04T17:49:23Z) - Neural Scene Representation for Locomotion on Structured Terrain [56.48607865960868]
We propose a learning-based method to reconstruct the local terrain for a mobile robot traversing urban environments.
Using a stream of depth measurements from the onboard cameras and the robot's trajectory, the estimates the topography in the robot's vicinity.
We propose a 3D reconstruction model that faithfully reconstructs the scene, despite the noisy measurements and large amounts of missing data coming from the blind spots of the camera arrangement.
arXiv Detail & Related papers (2022-06-16T10:45:17Z) - GIMO: Gaze-Informed Human Motion Prediction in Context [75.52839760700833]
We propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, and ego-centric views with eye gaze.
Our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects.
To realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches.
arXiv Detail & Related papers (2022-04-20T13:17:39Z) - GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras [99.07219478953982]
We present an approach for 3D global human mesh recovery from monocular videos recorded with dynamic cameras.
We first propose a deep generative motion infiller, which autoregressively infills the body motions of occluded humans based on visible motions.
In contrast to prior work, our approach reconstructs human meshes in consistent global coordinates even with dynamic cameras.
arXiv Detail & Related papers (2021-12-02T18:59:54Z) - AC-VRNN: Attentive Conditional-VRNN for Multi-Future Trajectory
Prediction [30.61190086847564]
We propose a generative architecture for multi-future trajectory predictions based on Conditional Variational Recurrent Neural Networks (C-VRNNs)
Human interactions are modeled with a graph-based attention mechanism enabling an online attentive hidden state refinement of the recurrent estimation.
arXiv Detail & Related papers (2020-05-17T17:21:23Z) - Visual Navigation Among Humans with Optimal Control as a Supervisor [72.5188978268463]
We propose an approach that combines learning-based perception with model-based optimal control to navigate among humans.
Our approach is enabled by our novel data-generation tool, HumANav.
We demonstrate that the learned navigation policies can anticipate and react to humans without explicitly predicting future human motion.
arXiv Detail & Related papers (2020-03-20T16:13:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.