Related papers: COPILOT: Human-Environment Collision Prediction and Localization from Egocentric Videos

COPILOT: Human-Environment Collision Prediction and Localization from Egocentric Videos

URL: http://arxiv.org/abs/2210.01781v2
Date: Sun, 26 Mar 2023 05:27:31 GMT
Title: COPILOT: Human-Environment Collision Prediction and Localization from Egocentric Videos
Authors: Boxiao Pan, Bokui Shen, Davis Rempe, Despoina Paschalidou, Kaichun Mo, Yanchao Yang, Leonidas J. Guibas
Abstract summary: The ability to forecast human-environment collisions from egocentric observations is vital to enable collision avoidance in applications such as VR, AR, and wearable assistive robotics. We introduce the challenging problem of predicting collisions in diverse environments from multi-view egocentric videos captured from body-mounted cameras. We propose a transformer-based model called COPILOT to perform collision prediction and localization simultaneously.
Score: 62.34712951567793
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ability to forecast human-environment collisions from egocentric observations is vital to enable collision avoidance in applications such as VR, AR, and wearable assistive robotics. In this work, we introduce the challenging problem of predicting collisions in diverse environments from multi-view egocentric videos captured from body-mounted cameras. Solving this problem requires a generalizable perception system that can classify which human body joints will collide and estimate a collision region heatmap to localize collisions in the environment. To achieve this, we propose a transformer-based model called COPILOT to perform collision prediction and localization simultaneously, which accumulates information across multi-view inputs through a novel 4D space-time-viewpoint attention mechanism. To train our model and enable future research on this task, we develop a synthetic data generation framework that produces egocentric videos of virtual humans moving and colliding within diverse 3D environments. This framework is then used to establish a large-scale dataset consisting of 8.6M egocentric RGBD frames. Extensive experiments show that COPILOT generalizes to unseen synthetic as well as real-world scenes. We further demonstrate COPILOT outputs are useful for downstream collision avoidance through simple closed-loop control. Please visit our project webpage at https://sites.google.com/stanford.edu/copilot.

Related papers

COME: Adding Scene-Centric Forecasting Control to Occupancy World Model [18.815436110557112]
World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data.<n>Existing methods struggle to disentangle ego-vehicle motion (perspective shifts from scene evolvement)<n>We propose to separate environmental changes from ego-motion by leveraging the scene-centric coordinate systems.
arXiv Detail & Related papers (2025-06-16T09:01:09Z)
ECAM: A Contrastive Learning Approach to Avoid Environmental Collision in Trajectory Forecasting [2.0195517740356808]
This paper introduces ECAM, a contrastive learning-based module to enhance collision avoidance ability with the environment.<n>The proposed module can be integrated into existing trajectory forecasting models, improving their ability to generate collision-free predictions.<n>Our experiments show that state-of-the-art methods significantly reduce (-40/50%) the collision rate when integrated with the proposed module.
arXiv Detail & Related papers (2025-06-11T11:35:36Z)
Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene [56.73568220959019]
Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial. We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene. We present the very first solution, using a combination of simulated collaborative data and real ego-car data.
arXiv Detail & Related papers (2025-02-10T17:07:53Z)
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation [55.26713167507132]
We introduce a generative robotics foundation model that constructs and interprets embodied spaces. EnerVerse employs an autoregressive video diffusion framework to predict future embodied spaces from instructions, enhanced by a sparse context memory for long-term reasoning. We present EnerVerse-D, a data engine pipeline combining the generative model with 4D Gaussian Splatting, forming a self-reinforcing data loop to reduce the sim-to-real gap.
arXiv Detail & Related papers (2025-01-03T17:00:33Z)
Human-Aware 3D Scene Generation with Spatially-constrained Diffusion Models [16.259040755335885]
Previous auto-regression-based 3D scene generation methods have struggled to accurately capture the joint distribution of multiple objects and input humans. We introduce two spatial collision guidance mechanisms: human-object collision avoidance and object-room boundary constraints. Our framework can generate more natural and plausible 3D scenes with precise human-scene interactions.
arXiv Detail & Related papers (2024-06-26T08:18:39Z)
EgoNav: Egocentric Scene-aware Human Trajectory Prediction [15.346096596482857]
Wearable collaborative robots stand to assist human wearers who need fall prevention assistance or wear exoskeletons. Such a robot needs to be able to constantly adapt to the surrounding scene based on egocentric vision, and predict the ego motion of the wearer. In this work, we leveraged body-mounted cameras and sensors to anticipate the trajectory of human wearers through complex surroundings.
arXiv Detail & Related papers (2024-03-27T21:43:12Z)
NeuPAN: Direct Point Robot Navigation with End-to-End Model-based Learning [67.53972459080437]
This paper presents NeuPAN: a real-time, highly-accurate, robot-agnostic, and environment-invariant robot navigation solution. Leveraging a tightly-coupled perception-locomotion framework, NeuPAN has two key innovations compared to existing approaches. We evaluate NeuPAN on car-like robot, wheel-legged robot, and passenger autonomous vehicle, in both simulated and real-world environments.
arXiv Detail & Related papers (2024-03-11T15:44:38Z)
Robots That Can See: Leveraging Human Pose for Trajectory Prediction [30.919756497223343]
We present a Transformer based architecture to predict human future trajectories in human-centric environments. The resulting model captures the inherent uncertainty for future human trajectory prediction. We identify new agents with limited historical data as a major contributor to error and demonstrate the complementary nature of 3D skeletal poses in reducing prediction error.
arXiv Detail & Related papers (2023-09-29T13:02:56Z)
CabiNet: Scaling Neural Collision Detection for Object Rearrangement with Procedural Scene Generation [54.68738348071891]
We first generate over 650K cluttered scenes - orders of magnitude more than prior work - in diverse everyday environments. We render synthetic partial point clouds from this data and use it to train our CabiNet model architecture. CabiNet is a collision model that accepts object and scene point clouds, captured from a single-view depth observation.
arXiv Detail & Related papers (2023-04-18T21:09:55Z)
GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras [99.07219478953982]
We present an approach for 3D global human mesh recovery from monocular videos recorded with dynamic cameras. We first propose a deep generative motion infiller, which autoregressively infills the body motions of occluded humans based on visible motions. In contrast to prior work, our approach reconstructs human meshes in consistent global coordinates even with dynamic cameras.
arXiv Detail & Related papers (2021-12-02T18:59:54Z)
Egocentric Human Trajectory Forecasting with a Wearable Camera and Multi-Modal Fusion [24.149925005674145]
We address the problem of forecasting the trajectory of an egocentric camera wearer (ego-person) in crowded spaces. The trajectory forecasting ability learned from the data of different camera wearers can be transferred to assist visually impaired people in navigation. A Transformer-based encoder-decoder neural network model, integrated with a novel cascaded cross-attention mechanism has been designed to predict the future trajectory of the camera wearer.
arXiv Detail & Related papers (2021-11-01T14:58:05Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
Object Rearrangement Using Learned Implicit Collision Functions [61.90305371998561]
We propose a learned collision model that accepts scene and query object point clouds and predicts collisions for 6DOF object poses within the scene. We leverage the learned collision model as part of a model predictive path integral (MPPI) policy in a tabletop rearrangement task. The learned model outperforms both traditional pipelines and learned ablations by 9.8% in accuracy on a dataset of simulated collision queries.
arXiv Detail & Related papers (2020-11-21T05:36:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.