ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
- URL: http://arxiv.org/abs/2506.07725v1
- Date: Mon, 09 Jun 2025 13:11:02 GMT
- Title: ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
- Authors: Shadi Hamdan, Chonghao Sima, Zetong Yang, Hongyang Li, Fatma Güney,
- Abstract summary: A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses.<n>Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results.<n>Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step.<n>ETA advances state-of-the-art performance by 8% with a driving score of 69.53 while maintaining a near-real
- Score: 21.645510959114326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How can we benefit from large models without sacrificing inference speed, a common dilemma in self-driving systems? A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses. Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results. However, these works still struggle to enable large models for a timely response to every online frame. Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step. To achieve the shifting, we introduce Efficiency through Thinking Ahead (ETA), an asynchronous system designed to: (1) propagate informative features from the past to the current frame using future predictions from the large model, (2) extract current frame features using a small model for real-time responsiveness, and (3) integrate these dual features via an action mask mechanism that emphasizes action-critical image regions. Evaluated on the Bench2Drive CARLA Leaderboard-v2 benchmark, ETA advances state-of-the-art performance by 8% with a driving score of 69.53 while maintaining a near-real-time inference speed at 50 ms.
Related papers
- TDFormer: A Top-Down Attention-Controlled Spiking Transformer [33.07648914591285]
We introduce TDFormer, a novel model with a top-down feedback structure that functions hierarchically.<n>We find that these mechanisms together significantly and consistently improve the model performance on multiple datasets.<n>In particular, our model achieves state-of-the-art performance on ImageNet with an accuracy of 86.83%.
arXiv Detail & Related papers (2025-05-17T15:55:32Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - Trajeglish: Traffic Modeling as Next-Token Prediction [67.28197954427638]
A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs.
We apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios.
Our model tops the Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%.
arXiv Detail & Related papers (2023-12-07T18:53:27Z) - A Fast and Map-Free Model for Trajectory Prediction in Traffics [2.435517936694533]
This paper proposes an efficient trajectory prediction model that is not dependent on traffic maps.
By comprehensively utilizing attention mechanism, LSTM, graph convolution network and temporal transformer, our model is able to learn rich dynamic and interaction information of all agents.
Our model achieves the highest performance when comparing with existing map-free methods and also exceeds most map-based state-of-the-art methods on the Argoverse dataset.
arXiv Detail & Related papers (2023-07-19T08:36:31Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - An end-to-end multi-scale network for action prediction in videos [31.967024536359908]
We develop an efficient multi-scale network to predict action classes in partial videos in an end-to-end manner.
Our E2EMSNet is evaluated on three challenging datasets: BIT, HMDB51, and UCF101.
arXiv Detail & Related papers (2022-12-31T06:58:41Z) - Multi-Modal Temporal Convolutional Network for Anticipating Actions in
Egocentric Videos [22.90184887794109]
Methods that are accurate but not sufficiently fast would introduce a high latency into the decision process.
This poses a problem for domains such as autonomous driving, where the reaction time is crucial.
We propose a simple and effective multi-modal architecture based on temporal convolutions.
arXiv Detail & Related papers (2021-07-18T16:21:35Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z) - All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced
Motion Modeling [52.425236515695914]
State-of-the-art methods are iterative solutions interpolating one frame at the time.
This work introduces a true multi-frame interpolator.
It utilizes a pyramidal style network in the temporal domain to complete the multi-frame task in one-shot.
arXiv Detail & Related papers (2020-07-23T02:34:39Z) - A Real-Time Deep Network for Crowd Counting [12.615660025855604]
We propose a compact convolutional neural network for crowd counting.
With three parallel filters executing the convolutional operation on the input image simultaneously at the front of the network, our model could achieve nearly real-time speed.
arXiv Detail & Related papers (2020-02-16T06:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.