Wayformer: Motion Forecasting via Simple & Efficient Attention Networks
- URL: http://arxiv.org/abs/2207.05844v1
- Date: Tue, 12 Jul 2022 21:19:04 GMT
- Title: Wayformer: Motion Forecasting via Simple & Efficient Attention Networks
- Authors: Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S.
Refaat, Benjamin Sapp
- Abstract summary: We present Wayformer, a family of attention based architectures for motion forecasting that are simple and homogeneous.
For each fusion type we explore strategies to tradeoff efficiency and quality via factorized attention or latent query attention.
We show that early fusion, despite its simplicity of construction, is not only modality but also achieves state-of-the-art results on both Open MotionDataset (WOMD) and Argoverse leaderboards.
- Score: 16.031530911221534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motion forecasting for autonomous driving is a challenging task because
complex driving scenarios result in a heterogeneous mix of static and dynamic
inputs. It is an open problem how best to represent and fuse information about
road geometry, lane connectivity, time-varying traffic light state, and history
of a dynamic set of agents and their interactions into an effective encoding.
To model this diverse set of input features, many approaches proposed to design
an equally complex system with a diverse set of modality specific modules. This
results in systems that are difficult to scale, extend, or tune in rigorous
ways to trade off quality and efficiency. In this paper, we present Wayformer,
a family of attention based architectures for motion forecasting that are
simple and homogeneous. Wayformer offers a compact model description consisting
of an attention based scene encoder and a decoder. In the scene encoder we
study the choice of early, late and hierarchical fusion of the input
modalities. For each fusion type we explore strategies to tradeoff efficiency
and quality via factorized attention or latent query attention. We show that
early fusion, despite its simplicity of construction, is not only modality
agnostic but also achieves state-of-the-art results on both Waymo Open
MotionDataset (WOMD) and Argoverse leaderboards, demonstrating the
effectiveness of our design philosophy
Related papers
- DeMo: Decoupling Motion Forecasting into Directional Intentions and Dynamic States [6.856351850183536]
We introduce DeMo, a framework that decouples multi-modal trajectory queries into two types.
By leveraging this format, we separately optimize the multi-modality and dynamic evolutionary properties of trajectories.
We additionally introduce combined Attention and Mamba techniques for global information aggregation and state sequence modeling.
arXiv Detail & Related papers (2024-10-08T12:27:49Z) - EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation [17.0226030258296]
Associating driver attention with driving scene across two fields of views is a hard cross-domain perception problem.
Previous methods typically focus on a single view or map attention to the scene via estimated gaze.
We propose a novel method for end-to-end scene-associated driver attention estimation, called EraWNet.
arXiv Detail & Related papers (2024-08-16T07:12:47Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - ProphNet: Efficient Agent-Centric Motion Forecasting with
Anchor-Informed Proposals [6.927103549481412]
Motion forecasting is a key module in an autonomous driving system.
Due to the heterogeneous nature of multi-sourced input, multimodality in agent behavior, and low latency required by onboard deployment, this task is notoriously challenging.
This paper proposes a novel agent-centric model with anchor-informed proposals for efficient multimodal motion prediction.
arXiv Detail & Related papers (2023-03-21T17:58:28Z) - MultiPath++: Efficient Information Fusion and Trajectory Aggregation for
Behavior Prediction [42.563865078323204]
We present MultiPath++, a future prediction model that achieves state-of-the-art performance on popular benchmarks.
We show that our proposed model achieves state-of-the-art performance on the Argoverse Motion Forecasting Competition and Open Motion Prediction Challenge.
arXiv Detail & Related papers (2021-11-29T21:36:53Z) - Decoder Fusion RNN: Context and Interaction Aware Decoders for
Trajectory Prediction [53.473846742702854]
We propose a recurrent, attention-based approach for motion forecasting.
Decoder Fusion RNN (DF-RNN) is composed of a recurrent behavior encoder, an inter-agent multi-headed attention module, and a context-aware decoder.
We demonstrate the efficacy of our method by testing it on the Argoverse motion forecasting dataset and show its state-of-the-art performance on the public benchmark.
arXiv Detail & Related papers (2021-08-12T15:53:37Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z) - Multi-intersection Traffic Optimisation: A Benchmark Dataset and a
Strong Baseline [85.9210953301628]
Control of traffic signals is fundamental and critical to alleviate traffic congestion in urban areas.
Because of the high complexity of modelling the problem, experimental settings of current works are often inconsistent.
We propose a novel and strong baseline model based on deep reinforcement learning with the encoder-decoder structure.
arXiv Detail & Related papers (2021-01-24T03:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.