Related papers: 3DPillars: Pillar-based two-stage 3D object detection

3DPillars: Pillar-based two-stage 3D object detection

URL: http://arxiv.org/abs/2509.05780v1
Date: Sat, 06 Sep 2025 17:23:01 GMT
Title: 3DPillars: Pillar-based two-stage 3D object detection
Authors: Jongyoun Noh, Junghyup Lee, Hyekang Park, Bumsub Ham,
Abstract summary: PointPillars is the fastest 3D object detector that exploits pseudo image representations to encode features for 3D objects in a scene.<n>We introduce in this paper the first two-stage 3D detection framework exploiting pseudo image representations.
Score: 29.757231369014068
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: PointPillars is the fastest 3D object detector that exploits pseudo image representations to encode features for 3D objects in a scene. Albeit efficient, PointPillars is typically outperformed by state-of-the-art 3D detection methods due to the following limitations: 1) The pseudo image representations fail to preserve precise 3D structures, and 2) they make it difficult to adopt a two-stage detection pipeline using 3D object proposals that typically shows better performance than a single-stage approach. We introduce in this paper the first two-stage 3D detection framework exploiting pseudo image representations, narrowing the performance gaps between PointPillars and state-of-the-art methods, while retaining its efficiency. Our framework consists of two novel components that overcome the aforementioned limitations of PointPillars: First, we introduce a new CNN architecture, dubbed 3DPillars, that enables learning 3D voxel-based features from the pseudo image representation efficiently using 2D convolutions. The basic idea behind 3DPillars is that 3D features from voxels can be viewed as a stack of pseudo images. To implement this idea, we propose a separable voxel feature module that extracts voxel-based features without using 3D convolutions. Second, we introduce an RoI head with a sparse scene context feature module that aggregates multi-scale features from 3DPillars to obtain a sparse scene feature. This enables adopting a two-stage pipeline effectively, and fully leveraging contextual information of a scene to refine 3D object proposals. Experimental results on the KITTI and Waymo Open datasets demonstrate the effectiveness and efficiency of our approach, achieving a good compromise in terms of speed and accuracy.

Related papers

TUN3D: Towards Real-World Scene Understanding from Unposed Images [11.23080017635425]
TUN3D is a new method that tackles joint layout estimation and 3D object detection in real scans.<n>It does not require ground-truth camera poses or depth supervision.<n>It achieves state-of-the-art performance across three challenging scene understanding benchmarks.
arXiv Detail & Related papers (2025-09-23T20:24:07Z)
TSP3D: Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding [74.033589504806]
We propose an efficient multi-level convolution architecture for 3D visual grounding.<n>Our method achieves top inference speed and surpasses previous fastest method by 100% FPS.
arXiv Detail & Related papers (2025-02-14T18:59:59Z)
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images. We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image. We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z)
MS23D: A 3D Object Detection Method Using Multi-Scale Semantic Feature Points to Construct 3D Feature Layer [4.644319899528183]
LiDAR point clouds can effectively depict the motion and posture of objects in three-dimensional space. In autonomous driving scenarios, the sparsity and hollowness of point clouds create some difficulties for voxel-based methods. We propose a two-stage 3D object detection framework, called MS23D.
arXiv Detail & Related papers (2023-08-31T08:03:25Z)
Unleash the Potential of Image Branch for Cross-modal 3D Object Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects. First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation. Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z)
Consistency of Implicit and Explicit Features Matters for Monocular 3D Object Detection [4.189643331553922]
Monocular 3D object detection is a common solution for low-cost autonomous agents to perceive their surroundings. We present CIEF, with the first orientation-aware image backbone to eliminate the disparity of implicit and explicit features in subsequent 3D representation. CIEF ranked 1st among all reported methods on both 3D and BEV detection benchmark of KITTI at submission time.
arXiv Detail & Related papers (2022-07-16T13:00:32Z)
VPIT: Real-time Embedded Single Object 3D Tracking Using Voxel Pseudo Images [90.60881721134656]
We propose a novel voxel-based 3D single object tracking (3D SOT) method called Voxel Pseudo Image Tracking (VPIT) Experiments on KITTI Tracking dataset show that VPIT is the fastest 3D SOT method and maintains competitive Success and Precision values.
arXiv Detail & Related papers (2022-06-06T14:02:06Z)
Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data [80.14669385741202]
We propose a self-supervised pre-training method for 3D perception models tailored to autonomous driving data. We leverage the availability of synchronized and calibrated image and Lidar sensors in autonomous driving setups. Our method does not require any point cloud nor image annotations.
arXiv Detail & Related papers (2022-03-30T12:40:30Z)
Improving 3D Object Detection with Channel-wise Transformer [58.668922561622466]
We propose a two-stage 3D object detection framework (CT3D) with minimal hand-crafted design. CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation. It achieves the AP of 81.77% in the moderate car category on the KITTI test 3D detection benchmark.
arXiv Detail & Related papers (2021-08-23T02:03:40Z)
HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection [39.64891219500416]
3D object detection methods exploit either voxel-based or point-based features to represent 3D objects in a scene. We introduce in this paper a novel single-stage 3D detection method having the merit of both voxel-based and point-based features.
arXiv Detail & Related papers (2021-04-02T06:34:49Z)
Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras. We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation [3.1542695050861544]
Estimating 3D orientation and translation of objects is essential for infrastructure-less autonomous navigation and driving. We propose a novel 3D object detection method, named SMOKE, that combines a single keypoint estimate with regressed 3D variables. Despite of its structural simplicity, our proposed SMOKE network outperforms all existing monocular 3D detection methods on the KITTI dataset.
arXiv Detail & Related papers (2020-02-24T08:15:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.