Related papers: Fast and Efficient Transformer-based Method for Bird's Eye View Instance Prediction

Fast and Efficient Transformer-based Method for Bird's Eye View Instance Prediction

URL: http://arxiv.org/abs/2411.06851v1
Date: Mon, 11 Nov 2024 10:35:23 GMT
Title: Fast and Efficient Transformer-based Method for Bird's Eye View Instance Prediction
Authors: Miguel Antunes-García, Luis M. Bergasa, Santiago Montiel-Marín, Rafael Barea, Fabio Sánchez-García, Ángel Llamazares,
Abstract summary: This paper introduces a novel BEV instance prediction architecture based on a simplified paradigm. The proposed system prioritizes speed, aiming at reduced parameter counts and inference times. implementation of the proposed architecture is optimized for performance improvements in PyTorch version 2.1.
Score: 0.8458547573621331
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate object detection and prediction are critical to ensure the safety and efficiency of self-driving architectures. Predicting object trajectories and occupancy enables autonomous vehicles to anticipate movements and make decisions with future information, increasing their adaptability and reducing the risk of accidents. Current State-Of-The-Art (SOTA) approaches often isolate the detection, tracking, and prediction stages, which can lead to significant prediction errors due to accumulated inaccuracies between stages. Recent advances have improved the feature representation of multi-camera perception systems through Bird's-Eye View (BEV) transformations, boosting the development of end-to-end systems capable of predicting environmental elements directly from vehicle sensor data. These systems, however, often suffer from high processing times and number of parameters, creating challenges for real-world deployment. To address these issues, this paper introduces a novel BEV instance prediction architecture based on a simplified paradigm that relies only on instance segmentation and flow prediction. The proposed system prioritizes speed, aiming at reduced parameter counts and inference times compared to existing SOTA architectures, thanks to the incorporation of an efficient transformer-based architecture. Furthermore, the implementation of the proposed architecture is optimized for performance improvements in PyTorch version 2.1. Code and trained models are available at https://github.com/miguelag99/Efficient-Instance-Prediction

Related papers

Overcoming Semantic Dilution in Transformer-Based Next Frame Prediction [0.9776703963093367]
Next-frame prediction in videos is crucial for applications such as autonomous driving, object tracking, and motion prediction. transformer-based next-frame prediction models face notable issues. We propose a Semantic Concentration Multi-Head Self-Attention architecture, which effectively mitigates semantic dilution.
arXiv Detail & Related papers (2025-01-28T07:12:29Z)
An End-to-End Smart Predict-then-Optimize Framework for Vehicle Relocation Problems in Large-Scale Vehicle Crowd Sensing [10.74565749809106]
Vehicle systems often exhibit biased coverage due to the nature of trip requests and routes. We develop an end-to-end Smart Predict-then-optize (SPO) framework by integrating optimization into prediction. The framework is trained by task-specific matching rather than the upstream prediction error.
arXiv Detail & Related papers (2024-11-27T15:16:22Z)
OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries. OPUS incorporates a suite of non-trivial strategies to enhance model performance. Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z)
Self-supervised Multi-future Occupancy Forecasting for Autonomous Driving [45.886941596233974]
LiDAR-generated occupancy grid maps (L-OGMs) offer a robust bird's-eye view for the scene representation. Our proposed framework performs L-OGM prediction in the latent space of a generative architecture. We decode predictions using either a single-step decoder, which provides high-quality predictions in real-time, or a diffusion-based batch decoder.
arXiv Detail & Related papers (2024-07-30T18:37:59Z)
Are Self-Attentions Effective for Time Series Forecasting? [4.990206466948269]
Time series forecasting is crucial for applications across multiple domains and various scenarios. Recent findings have indicated that simpler linear models might outperform complex Transformer-based approaches. We introduce a new architecture, Cross-Attention-only Time Series transformer (CATS) Our model achieves superior performance with the lowest mean squared error and uses fewer parameters compared to existing models.
arXiv Detail & Related papers (2024-05-27T06:49:39Z)
FipTR: A Simple yet Effective Transformer Framework for Future Instance Prediction in Autonomous Driving [8.370230253558159]
The future instance prediction from a Bird's Eye View(BEV) perspective is a vital component in autonomous driving. We propose a simple yet effective fully end-to-end framework named Future Instance Prediction Transformer(FipTR) In this paper, we propose a simple yet effective fully end-to-end framework named Future Instance Prediction Transformer(FipTR)
arXiv Detail & Related papers (2024-04-19T13:08:43Z)
Knowledge-aware Graph Transformer for Pedestrian Trajectory Prediction [15.454206825258169]
Predicting pedestrian motion trajectories is crucial for path planning and motion control of autonomous vehicles. Recent deep learning-based prediction approaches mainly utilize information like trajectory history and interactions between pedestrians. This paper proposes a graph transformer structure to improve prediction performance.
arXiv Detail & Related papers (2024-01-10T01:50:29Z)
Unsupervised Domain Adaptation for Self-Driving from Past Traversal Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments. Our approach enhances LiDAR-based detection models using spatial quantized historical features. Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z)
Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving [68.95178518732965]
A self-driving vehicle (SDV) must be able to perceive its surroundings and predict the future behavior of other traffic participants. Existing works either perform object detection followed by trajectory of the detected objects, or predict dense occupancy and flow grids for the whole scene. This motivates our unified approach to perception and future prediction that implicitly represents occupancy and flow over time with a single neural network.
arXiv Detail & Related papers (2023-08-02T23:39:24Z)
Conditioned Human Trajectory Prediction using Iterative Attention Blocks [70.36888514074022]
We present a simple yet effective pedestrian trajectory prediction model aimed at pedestrians positions prediction in urban-like environments. Our model is a neural-based architecture that can run several layers of attention blocks and transformers in an iterative sequential fashion. We show that without explicit introduction of social masks, dynamical models, social pooling layers, or complicated graph-like structures, it is possible to produce on par results with SoTA models.
arXiv Detail & Related papers (2022-06-29T07:49:48Z)
Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception. We build a simple and effective framework for streaming perception. Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z)
PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer [0.9786690381850356]
We introduce a model called PRediction Transformer (PReTR) that extracts features from the multi-agent scenes by employing a factorized-temporal attention module. It shows less computational needs than previously studied models with empirically better results. We leverage encoder-decoder Transformer networks for parallel decoding a set of learned object queries.
arXiv Detail & Related papers (2022-03-17T12:52:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.