SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction
- URL: http://arxiv.org/abs/2602.04240v1
- Date: Wed, 04 Feb 2026 05:59:24 GMT
- Title: SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction
- Authors: Suzeyu Chen, Leheng Li, Ying-Cong Chen,
- Abstract summary: We propose a Prototype-based Sparse Transformer Decoder.<n>Our core idea is to make the decoder's attention prototype-guided.<n>Our model, dubbed SPOT-Occ, outperforms previous methods with a significant margin in speed.
- Score: 33.82331348959048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving highly accurate and real-time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non-uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype-based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two-stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder's attention prototype-guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed prototypes, for focused feature aggregation. To ensure this dynamic selection is stable and effective, we introduce a complementary denoising paradigm. This approach leverages ground-truth masks to provide explicit guidance, guaranteeing a consistent query-prototype association across decoder layers. Our model, dubbed SPOT-Occ, outperforms previous methods with a significant margin in speed while also improving accuracy. Source code is released at https://github.com/chensuzeyu/SpotOcc.
Related papers
- Leveraging Transformer Decoder for Automotive Radar Object Detection [9.764772760421792]
We present a Transformer-based architecture for 3D radar object detection that uses a novel Transformer Decoder.<n> Pyramid Token Fusion (PTF) converts a feature pyramid into a unified, scale-aware token sequence.<n>We evaluate the proposed framework on the RADDet, where it achieves significant improvements over state-of-the-art radar-only baselines.
arXiv Detail & Related papers (2026-01-19T20:44:24Z) - Enhancing Floor Plan Recognition: A Hybrid Mix-Transformer and U-Net Approach for Precise Wall Segmentation [0.0]
This study introduces MitUNet, a hybrid neural network combining a Mix-Transformer encoder and a U-Net decoder.<n>Our approach achieves a balance between precision and recall, ensuring accurate boundary recovery.<n> Experiments on the CubiCasa5k dataset and a proprietary regional dataset demonstrate MitUNet's superiority in generating structurally correct masks.
arXiv Detail & Related papers (2025-12-02T04:47:53Z) - ProtoOcc: Accurate, Efficient 3D Occupancy Prediction Using Dual Branch Encoder-Prototype Query Decoder [8.494947115633682]
ProtoOcc is a novel 3D occupancy prediction model designed to predict the occupancy states and semantic classes of 3D voxels.<n>ProtoOcc achieves state-of-the-art performance with 45.02% mIoU on the Occ3D-nuScenes benchmark.
arXiv Detail & Related papers (2024-12-11T20:55:21Z) - OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries.
OPUS incorporates a suite of non-trivial strategies to enhance model performance.
Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z) - Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion.
In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z) - Unifying Voxel-based Representation with Transformer for 3D Object
Detection [143.91910747605107]
We present a unified framework for multi-modality 3D object detection, named UVTR.
The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection.
UVTR achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively.
arXiv Detail & Related papers (2022-06-01T17:02:40Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it.
Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z) - Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z) - Dynamic Convolution for 3D Point Cloud Instance Segmentation [146.7971476424351]
We propose an approach to instance segmentation from 3D point clouds based on dynamic convolution.
We gather homogeneous points that have identical semantic categories and close votes for the geometric centroids.
The proposed approach is proposal-free, and instead exploits a convolution process that adapts to the spatial and semantic characteristics of each instance.
arXiv Detail & Related papers (2021-07-18T09:05:16Z) - SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection.
We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors.
Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.