FastBEV++: Fast by Algorithm, Deployable by Design
- URL: http://arxiv.org/abs/2512.08237v1
- Date: Tue, 09 Dec 2025 04:37:46 GMT
- Title: FastBEV++: Fast by Algorithm, Deployable by Design
- Authors: Yuanpeng Chen, Hui Song, Wei Tao, ShanHui Mo, Shuang Zhang, Xiao Hua, TianKun Zhao,
- Abstract summary: This paper introduces FastBEV++, a framework engineered to reconcile state-of-the-art performance and on-vehicle deployment tractability.<n>We realize the "Deployable by Design" principle through a novel view paradigm that decomposes the monolithic projection into a standard Index-Gather-Reshape pipeline.
- Score: 5.339716421285263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advancement of camera-only Bird's-Eye-View(BEV) perception is currently impeded by a fundamental tension between state-of-the-art performance and on-vehicle deployment tractability. This bottleneck stems from a deep-rooted dependency on computationally prohibitive view transformations and bespoke, platform-specific kernels. This paper introduces FastBEV++, a framework engineered to reconcile this tension, demonstrating that high performance and deployment efficiency can be achieved in unison via two guiding principles: Fast by Algorithm and Deployable by Design. We realize the "Deployable by Design" principle through a novel view transformation paradigm that decomposes the monolithic projection into a standard Index-Gather-Reshape pipeline. Enabled by a deterministic pre-sorting strategy, this transformation is executed entirely with elementary, operator native primitives (e.g Gather, Matrix Multiplication), which eliminates the need for specialized CUDA kernels and ensures fully TensorRT-native portability. Concurrently, our framework is "Fast by Algorithm", leveraging this decomposed structure to seamlessly integrate an end-to-end, depth-aware fusion mechanism. This jointly learned depth modulation, further bolstered by temporal aggregation and robust data augmentation, significantly enhances the geometric fidelity of the BEV representation.Empirical validation on the nuScenes benchmark corroborates the efficacy of our approach. FastBEV++ establishes a new state-of-the-art 0.359 NDS while maintaining exceptional real-time performance, exceeding 134 FPS on automotive-grade hardware (e.g Tesla T4). By offering a solution that is free of custom plugins yet highly accurate, FastBEV++ presents a mature and scalable design philosophy for production autonomous systems. The code is released at: https://github.com/ymlab/advanced-fastbev
Related papers
- AutoNeural: Co-Designing Vision-Language Models for NPU Inference [24.05617280495125]
AutoNeural is an NPU-native VLM architecture co-designed for integer-only inference.<n>We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions.<n>Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines.
arXiv Detail & Related papers (2025-12-02T16:45:25Z) - Rethinking Vision Transformer Depth via Structural Reparameterization [16.12815682992294]
We propose a branch-based structural reparameterization technique that operates during the training phase.<n>Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models.<n>When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K.
arXiv Detail & Related papers (2025-11-24T21:28:55Z) - Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z) - Elastic ViTs from Pretrained Models without Retraining [74.5386166956142]
Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes.<n>We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers.<n>Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm.
arXiv Detail & Related papers (2025-10-20T16:15:03Z) - MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z) - ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - LPViT: Low-Power Semi-structured Pruning for Vision Transformers [43.126752035656196]
Vision transformers have emerged as a promising alternative to convolutional neural networks for image analysis tasks.<n>One significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, complexity, and power consumption.<n>We introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration.
arXiv Detail & Related papers (2024-07-02T08:58:19Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - Exploring Motion Ambiguity and Alignment for High-Quality Video Frame
Interpolation [46.02120172459727]
We propose to relax the requirement of reconstructing an intermediate frame as close to the ground-truth (GT) as possible.
We develop a texture consistency loss (TCL) upon the assumption that the interpolated content should maintain similar structures with their counterparts in the given frames.
arXiv Detail & Related papers (2022-03-19T10:37:06Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - SPViT: Enabling Faster Vision Transformers via Soft Token Pruning [38.10083471492964]
Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures.
We propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures.
Our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-12-27T20:15:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.