Related papers: Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving

Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving

URL: http://arxiv.org/abs/2506.12251v2
Date: Mon, 21 Jul 2025 17:22:35 GMT
Title: Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving
Authors: Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, Marco Pavone,
Abstract summary: Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures.<n>We present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering.<n> Experiments on a large-scale AV dataset and state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies.
Score: 33.2092963387255
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on a large-scale AV dataset and state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.

Related papers

Tracking Meets Large Multimodal Models for Driving Scenario Understanding [76.71815464110153]
Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research.<n>We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details.<n>We introduce a novel approach for embedding this tracking information into LMMs to enhance their understanding of driving scenarios.
arXiv Detail & Related papers (2025-03-18T17:59:12Z)
DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving [62.62464518137153]
DriveTransformer is a simplified E2E-AD framework for the ease of scaling up.<n>It is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention.<n>It achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.
arXiv Detail & Related papers (2025-03-07T11:41:18Z)
VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving [44.91443640710085]
VisionPAD is a novel self-supervised pre-training paradigm for vision-centric algorithms in autonomous driving.<n>It reconstructs multi-view representations using only images as supervision.<n>It significantly improves performance in 3D object detection, occupancy prediction and map segmentation.
arXiv Detail & Related papers (2024-11-22T03:59:41Z)
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks. We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
Fast and Efficient Transformer-based Method for Bird's Eye View Instance Prediction [0.8458547573621331]
This paper introduces a novel BEV instance prediction architecture based on a simplified paradigm. The proposed system prioritizes speed, aiming at reduced parameter counts and inference times. implementation of the proposed architecture is optimized for performance improvements in PyTorch version 2.1.
arXiv Detail & Related papers (2024-11-11T10:35:23Z)
DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information. DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z)
Unsupervised Domain Adaptation for Self-Driving from Past Traversal Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments. Our approach enhances LiDAR-based detection models using spatial quantized historical features. Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z)
Efficient Transformer-based 3D Object Detection with Dynamic Token Halting [19.88560740238657]
We propose an effective approach for accelerating transformer-based 3D object detectors by dynamically halting tokens at different layers. Although halting a token is a non-differentiable operation, our method allows for differentiable end-to-end learning. Our framework allows halted tokens to be reused to inform the model's predictions through a straightforward token recycling mechanism.
arXiv Detail & Related papers (2023-03-09T07:26:49Z)
Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml [6.223322030008291]
We show how field programmable gate arrays can serve as hardware accelerators for real-time semantic segmentation tasks relevant for autonomous driving. Considering compressed versions of the ENet convolutional neural network architecture, we demonstrate a fully-on-chip deployment with a latency of 4.9 ms per image. We show, through aggressive filter reduction and heterogeneous quantization-aware training, and an optimized implementation of convolutional layers, that the power consumption and resource utilization can be significantly reduced.
arXiv Detail & Related papers (2022-05-16T13:55:16Z)
Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem. We employ a Neural Message Passing network for data association that is fully trainable. We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z)
YOLOpeds: Efficient Real-Time Single-Shot Pedestrian Detection for Smart Camera Applications [2.588973722689844]
This work addresses the challenge of achieving a good trade-off between accuracy and speed for efficient deployment of deep-learning-based pedestrian detection in smart camera applications. A computationally efficient architecture is introduced based on separable convolutions and proposes integrating dense connections across layers and multi-scale feature fusion. Overall, YOLOpeds provides real-time sustained operation of over 30 frames per second with detection rates in the range of 86% outperforming existing deep learning models.
arXiv Detail & Related papers (2020-07-27T09:50:11Z)
One-step regression and classification with crosspoint resistive memory arrays [62.997667081978825]
High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge. One-step learning is supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition. Results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.
arXiv Detail & Related papers (2020-05-05T08:00:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.