Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection
- URL: http://arxiv.org/abs/2502.15488v2
- Date: Tue, 11 Mar 2025 15:05:41 GMT
- Title: Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection
- Authors: Jiangyong Yu, Changyong Shu, Dawei Yang, Sifan Zhou, Zichen Yu, Xing Hu, Yan Chen,
- Abstract summary: We propose Q-PETR, a quantization-aware position embedding transformation that re-engineers key components of the PETR framework.<n>Q-PETR maintains floating-point performance with a performance degradation of less than 1% under standard 8-bit per-tensor post-training quantization.<n>Compared to its FP32 counterpart, Q-PETR achieves a two-fold speedup and reduces memory usage by three times.
- Score: 9.961425621432474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Camera-based multi-view 3D detection has emerged as an attractive solution for autonomous driving due to its low cost and broad applicability. However, despite the strong performance of PETR-based methods in 3D perception benchmarks, their direct INT8 quantization for onboard deployment leads to drastic accuracy drops-up to 58.2% in mAP and 36.9% in NDS on the NuScenes dataset. In this work, we propose Q-PETR, a quantization-aware position embedding transformation that re-engineers key components of the PETR framework to reconcile the discrepancy between the dynamic ranges of positional encodings and image features, and to adapt the cross-attention mechanism for low-bit inference. By redesigning the positional encoding module and introducing an adaptive quantization strategy, Q-PETR maintains floating-point performance with a performance degradation of less than 1% under standard 8-bit per-tensor post-training quantization. Moreover, compared to its FP32 counterpart, Q-PETR achieves a two-fold speedup and reduces memory usage by three times, thereby offering a deployment-friendly solution for resource-constrained onboard devices. Extensive experiments across various PETR-series models validate the strong generalization and practical benefits of our approach.
Related papers
- Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques [0.0]
We show that float8 quantization achieves a 4x storage reduction with minimal performance degradation.
PCA emerges as the most effective dimensionality reduction technique.
We propose a methodology based on visualizing the performance-storage trade-off space to identify the optimal configuration.
arXiv Detail & Related papers (2025-04-30T18:20:16Z) - RoPETR: Improving Temporal Camera-Only 3D Detection by Integrating Enhanced Rotary Position Embedding [7.142677515668237]
This report introduces a targeted improvement to the StreamPETR framework, specifically aimed at enhancing velocity estimation.
Our improved approach achieves a state-of-the-art NDS of 70.86% using the ViT-L backbone, setting a new benchmark for camera-only 3D object detection.
arXiv Detail & Related papers (2025-04-17T05:05:31Z) - "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object
Detection [35.35457515189062]
Post-Training Quantization (PTQ) has been widely adopted in 2D vision tasks.
LiDAR-PTQ can achieve state-of-the-art quantization performance when applied to CenterPoint.
LiDAR-PTQ is cost-effective being $30times$ faster than the quantization-aware training method.
arXiv Detail & Related papers (2024-01-29T03:35:55Z) - Point Transformer V3: Simpler, Faster, Stronger [88.80496333515325]
This paper focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing.
We present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms.
PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios.
arXiv Detail & Related papers (2023-12-15T18:59:59Z) - Towards Clip-Free Quantized Super-Resolution Networks: How to Tame
Representative Images [16.18371675853725]
This study focuses on a very important but mostly overlooked post-training quantization step: representative dataset (RD)
We propose a novel pipeline (clip-free quantization pipeline, CFQP) backed up with extensive experimental justifications to cleverly augment RD images by only using outputs of the FP32 model.
arXiv Detail & Related papers (2023-08-22T11:41:08Z) - QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D
Object Detection [57.019527599167255]
Multi-view 3D detection based on BEV (bird-eye-view) has recently achieved significant improvements.
We show in our paper that directly applying quantization in BEV tasks will 1) make the training unstable, and 2) lead to intolerable performance degradation.
Our method QD-BEV enables a novel view-guided distillation (VGD) objective, which can stabilize the quantization-aware training (QAT) while enhancing the model performance.
arXiv Detail & Related papers (2023-08-21T07:06:49Z) - Improving Post-Training Quantization on Object Detection with Task
Loss-Guided Lp Metric [43.81334288840746]
Post-Training Quantization (PTQ) transforms a full-precision model into low bit-width directly.
PTQ suffers severe accuracy drop when applied to complex tasks such as object detection.
DetPTQ employs the ODOL-based adaptive Lp metric to select the optimal quantization parameters.
arXiv Detail & Related papers (2023-04-19T16:11:21Z) - Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object
Detection [11.13693561702228]
The dominant multi-camera 3D detection paradigm is based on explicit 3D feature construction.
Other methods implicitly introduce geometric positional encoding to build the relationship between image tokens and 3D objects.
We propose Focal-PETR with instance-guided supervision and spatial alignment module.
arXiv Detail & Related papers (2022-12-11T13:38:54Z) - Distortion-Aware Loop Filtering of Intra 360^o Video Coding with
Equirectangular Projection [81.63407194858854]
We propose a distortion-aware loop filtering model to improve the performance of intra coding for 360$o$ videos projected via equirectangular projection (ERP) format.
Our proposed module analyzes content characteristics based on a coding unit (CU) partition mask and processes them through partial convolution to activate the specified area.
arXiv Detail & Related papers (2022-02-20T12:00:18Z) - Improving 3D Object Detection with Channel-wise Transformer [58.668922561622466]
We propose a two-stage 3D object detection framework (CT3D) with minimal hand-crafted design.
CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation.
It achieves the AP of 81.77% in the moderate car category on the KITTI test 3D detection benchmark.
arXiv Detail & Related papers (2021-08-23T02:03:40Z) - Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning [5.236787242129767]
We present a novel 3D Transformer, called Point-Voxel Transformer (PVT) that leverages self-attention computation in points to gather global context features.
Our method fully exploits the potentials of Transformer architecture, paving the road to efficient and accurate recognition results.
arXiv Detail & Related papers (2021-08-13T06:07:57Z) - Uncertainty-Aware Camera Pose Estimation from Points and Lines [101.03675842534415]
Perspective-n-Point-and-Line (Pn$PL) aims at fast, accurate and robust camera localizations with respect to a 3D model from 2D-3D feature coordinates.
arXiv Detail & Related papers (2021-07-08T15:19:36Z) - Simple Training Strategies and Model Scaling for Object Detection [38.27709720726833]
We benchmark improvements on the vanilla ResNet-FPN backbone with RetinaNet and RCNN detectors.
The vanilla detectors are improved by 7.7% in accuracy while being 30% faster in speed.
Our largest Cascade RCNN-RS models achieve 52.9% AP with a ResNet152-FPN backbone and 53.6% with a SpineNet143L backbone.
arXiv Detail & Related papers (2021-06-30T18:41:47Z) - Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers.
We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z) - 3DSSD: Point-based 3D Single Stage Object Detector [61.67928229961813]
We present a point-based 3D single stage object detector, named 3DSSD, achieving a good balance between accuracy and efficiency.
Our method outperforms all state-of-the-art voxel-based single stage methods by a large margin, and has comparable performance to two stage point-based methods as well.
arXiv Detail & Related papers (2020-02-24T12:01:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.