SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving
- URL: http://arxiv.org/abs/2509.16588v1
- Date: Sat, 20 Sep 2025 09:25:19 GMT
- Title: SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving
- Authors: Haiming Zhang, Yiyao Zhu, Wending Zhou, Xu Yan, Yingjie Cai, Bingbing Liu, Shuguang Cui, Zhen Li,
- Abstract summary: We introduce SQS, a novel query-based splatting pre-training for sparse Perception Models (SPMs)<n> SQS predicts 3D Gaussian representations from sparse queries during pre-training, leveraging self-supervised splatting to learn fine-grained contextual features.<n>Experiments on autonomous driving benchmarks demonstrate that SQS delivers considerable performance gains across multiple query-based 3D perception tasks.
- Score: 56.198745862311824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse Perception Models (SPMs) adopt a query-driven paradigm that forgoes explicit dense BEV or volumetric construction, enabling highly efficient computation and accelerated inference. In this paper, we introduce SQS, a novel query-based splatting pre-training specifically designed to advance SPMs in autonomous driving. SQS introduces a plug-in module that predicts 3D Gaussian representations from sparse queries during pre-training, leveraging self-supervised splatting to learn fine-grained contextual features through the reconstruction of multi-view images and depth maps. During fine-tuning, the pre-trained Gaussian queries are seamlessly integrated into downstream networks via query interaction mechanisms that explicitly connect pre-trained queries with task-specific queries, effectively accommodating the diverse requirements of occupancy prediction and 3D object detection. Extensive experiments on autonomous driving benchmarks demonstrate that SQS delivers considerable performance gains across multiple query-based 3D perception tasks, notably in occupancy prediction and 3D object detection, outperforming prior state-of-the-art pre-training approaches by a significant margin (i.e., +1.3 mIoU on occupancy prediction and +1.0 NDS on 3D detection).
Related papers
- ALIGN: Advanced Query Initialization with LiDAR-Image Guidance for Occlusion-Robust 3D Object Detection [16.336860116706088]
We propose ALIGN, a novel approach for object-aware query initialization.<n>Our model consists of three key components: (i) Occlusion-aware Center Estimation (OCE), which integrates LiDAR geometry and image semantics.<n>Our experiments on the nuScenes benchmark demonstrate that ALIGN consistently improves performance across multiple state-of-the-art detectors.
arXiv Detail & Related papers (2025-12-20T02:51:00Z) - DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos [53.52664872583893]
Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving.<n>We propose DySS, a novel method that employs state-space learning and dynamic queries.<n>Our proposed DySS achieves both superior detection performance and efficient inference.
arXiv Detail & Related papers (2025-06-11T23:49:56Z) - OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries.
OPUS incorporates a suite of non-trivial strategies to enhance model performance.
Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z) - Divide and Conquer: Improving Multi-Camera 3D Perception with 2D Semantic-Depth Priors and Input-Dependent Queries [30.17281824826716]
Existing techniques often neglect the synergistic effects of semantic and depth cues, leading to classification and position estimation errors.
We propose an input-aware Transformer framework that leverages Semantics and Depth as priors.
Our approach involves the use of an S-D that explicitly models semantic and depth priors, thereby disentangling the learning process of object categorization and position estimation.
arXiv Detail & Related papers (2024-08-13T13:51:34Z) - S2-Track: A Simple yet Strong Approach for End-to-End 3D Multi-Object Tracking [38.63155724204429]
3D multiple object tracking (MOT) plays a crucial role in autonomous driving perception.<n>Recent end-to-end query-based trackers simultaneously detect and track objects, which have shown promising potential for the 3D MOT task.<n>Existing methods are still in the early stages of development and lack systematic improvements.
arXiv Detail & Related papers (2024-06-04T09:34:46Z) - OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision.
We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range.
For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Self-Supervised Representation Learning from Temporal Ordering of
Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks.
We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems.
Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z) - 3D-QueryIS: A Query-based Framework for 3D Instance Segmentation [74.6998931386331]
Previous methods for 3D instance segmentation often maintain inter-task dependencies and the tendency towards a lack of robustness.
We propose a novel query-based method, termed as 3D-QueryIS, which is detector-free, semantic segmentation-free, and cluster-free.
Our 3D-QueryIS is free from the accumulated errors caused by the inter-task dependencies.
arXiv Detail & Related papers (2022-11-17T07:04:53Z) - Superquadric Object Representation for Optimization-based Semantic SLAM [31.13636619458275]
We propose a pipeline to leverage semantic mask measurements to fit SQ parameters to multi-view camera observations.
We demonstrate the system's ability to retrieve randomly generated SQ parameters from multi-view mask observations.
arXiv Detail & Related papers (2021-09-20T15:27:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.