JointMotion: Joint Self-Supervision for Joint Motion Prediction
- URL: http://arxiv.org/abs/2403.05489v2
- Date: Wed, 23 Oct 2024 16:39:15 GMT
- Title: JointMotion: Joint Self-Supervision for Joint Motion Prediction
- Authors: Royden Wagner, Omer Sahin Tas, Marvin Klemp, Carlos Fernandez,
- Abstract summary: JointMotion is a self-supervised pre-training method for joint motion prediction in self-driving vehicles.
Our method reduces the joint final displacement error of Wayformer, HPTR, and Scene Transformer models by 3%, 8%, and 12%, respectively.
- Score: 10.44846560021422
- License:
- Abstract: We present JointMotion, a self-supervised pre-training method for joint motion prediction in self-driving vehicles. Our method jointly optimizes a scene-level objective connecting motion and environments, and an instance-level objective to refine learned representations. Scene-level representations are learned via non-contrastive similarity learning of past motion sequences and environment context. At the instance level, we use masked autoencoding to refine multimodal polyline representations. We complement this with an adaptive pre-training decoder that enables JointMotion to generalize across different environment representations, fusion mechanisms, and dataset characteristics. Notably, our method reduces the joint final displacement error of Wayformer, HPTR, and Scene Transformer models by 3\%, 8\%, and 12\%, respectively; and enables transfer learning between the Waymo Open Motion and the Argoverse 2 Motion Forecasting datasets. Code: https://github.com/kit-mrt/future-motion
Related papers
- GITSR: Graph Interaction Transformer-based Scene Representation for Multi Vehicle Collaborative Decision-making [9.910230703889956]
This study focuses on efficient scene representation and the modeling of spatial interaction behaviors of traffic states.
In this study, we propose GITSR, an effective framework for Graph Interaction Transformer-based Scene Representation.
arXiv Detail & Related papers (2024-11-03T15:27:26Z) - SceneMotion: From Agent-Centric Embeddings to Scene-Wide Forecasts [13.202036465220766]
Self-driving vehicles rely on multimodal motion forecasts to interact with their environment and plan safe maneuvers.
We introduce SceneMotion, an attention-based model for forecasting scene-wide motion modes of multiple traffic agents.
This module learns a scene-wide latent space from multiple agent-centric embeddings, enabling joint forecasting and interaction modeling.
arXiv Detail & Related papers (2024-08-02T18:49:14Z) - Real-Time Motion Prediction via Heterogeneous Polyline Transformer with
Relative Pose Encoding [121.08841110022607]
Existing agent-centric methods have demonstrated outstanding performance on public benchmarks.
We introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers.
By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods.
arXiv Detail & Related papers (2023-10-19T17:59:01Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Motion Transformer with Global Intention Localization and Local Movement
Refinement [103.75625476231401]
Motion TRansformer (MTR) models motion prediction as the joint optimization of global intention localization and local movement refinement.
MTR achieves state-of-the-art performance on both the marginal and joint motion prediction challenges.
arXiv Detail & Related papers (2022-09-27T16:23:14Z) - Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy.
MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z) - Implicit Latent Variable Model for Scene-Consistent Motion Forecasting [78.74510891099395]
In this paper, we aim to learn scene-consistent motion forecasts of complex urban traffic directly from sensor data.
We model the scene as an interaction graph and employ powerful graph neural networks to learn a distributed latent representation of the scene.
arXiv Detail & Related papers (2020-07-23T14:31:25Z) - Cross Scene Prediction via Modeling Dynamic Correlation using Latent
Space Shared Auto-Encoders [6.530318792830862]
Given a set of unsynchronized history observations of two scenes, the purpose is to learn a cross-scene predictor.
A method is proposed to solve the problem via modeling dynamic correlation using latent space shared auto-encoders.
arXiv Detail & Related papers (2020-03-31T03:08:23Z) - Collaborative Motion Prediction via Neural Motion Message Passing [37.72454920355321]
We propose neural motion message passing (NMMP) to explicitly model the interaction and learn representations for directed interactions between actors.
Based on the proposed NMMP, we design the motion prediction systems for two settings: the pedestrian setting and the joint pedestrian and vehicle setting.
Both systems outperform the previous state-of-the-art methods on several existing benchmarks.
arXiv Detail & Related papers (2020-03-14T10:12:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.