OpenBox: Annotate Any Bounding Boxes in 3D
- URL: http://arxiv.org/abs/2512.01352v1
- Date: Mon, 01 Dec 2025 07:04:48 GMT
- Title: OpenBox: Annotate Any Bounding Boxes in 3D
- Authors: In-Jae Lee, Mungyeom Kim, Kwonyoung Ryu, Pierre Musacchio, Jaesik Park,
- Abstract summary: We propose OpenBox, a two-stage automatic annotation pipeline for 3D object detection.<n>OpenBox associates instance-level cues from 2D images processed by a vision foundation model with the corresponding 3D point clouds.<n>It categorizes instances by rigidity and motion state, then generates adaptive bounding boxes with class-specific size statistics.
- Score: 26.95078474297576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unsupervised and open-vocabulary 3D object detection has recently gained attention, particularly in autonomous driving, where reducing annotation costs and recognizing unseen objects are critical for both safety and scalability. However, most existing approaches uniformly annotate 3D bounding boxes, ignore objects' physical states, and require multiple self-training iterations for annotation refinement, resulting in suboptimal quality and substantial computational overhead. To address these challenges, we propose OpenBox, a two-stage automatic annotation pipeline that leverages a 2D vision foundation model. In the first stage, OpenBox associates instance-level cues from 2D images processed by a vision foundation model with the corresponding 3D point clouds via cross-modal instance alignment. In the second stage, it categorizes instances by rigidity and motion state, then generates adaptive bounding boxes with class-specific size statistics. As a result, OpenBox produces high-quality 3D bounding box annotations without requiring self-training. Experiments on the Waymo Open Dataset, the Lyft Level 5 Perception dataset, and the nuScenes dataset demonstrate improved accuracy and efficiency over baselines.
Related papers
- VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering [18.77072205559739]
VSRD++ is a novel weakly supervised framework for monocular 3D object detection.<n>It eliminates the reliance on 3D annotations and leverages neural-field-based volumetric rendering.<n>In the monocular 3D object detection phase, the optimized 3D bounding boxes serve as pseudo labels.
arXiv Detail & Related papers (2025-12-01T01:28:35Z) - OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion [89.98812408058336]
We introduce textbfOpenInsGaussian, an textbfOpen-vocabulary textbfInstance textbfGaussian segmentation framework with Context-aware Cross-view Fusion.<n>OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin.
arXiv Detail & Related papers (2025-10-21T03:24:12Z) - Sparse Multiview Open-Vocabulary 3D Detection [27.57172918603858]
3D object detection has traditionally been solved by training to detect a fixed set of categories.<n>In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting.<n>Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion.
arXiv Detail & Related papers (2025-09-19T12:22:24Z) - Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data [57.53523870705433]
We propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det.
OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes.
It employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors.
arXiv Detail & Related papers (2024-11-23T21:37:21Z) - Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts [50.181870446016376]
This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts.
Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset.
arXiv Detail & Related papers (2024-07-16T04:53:28Z) - Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance [49.14140194332482]
We introduce Open3DIS, a novel solution designed to tackle the problem of Open-Vocabulary Instance within 3D scenes.
Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task.
arXiv Detail & Related papers (2023-12-17T10:07:03Z) - Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance [72.6809373191638]
We propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels.
Specifically, we design a feature-level constraint to align LiDAR and image features based on object-aware regions.
Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations.
Third, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data.
arXiv Detail & Related papers (2023-12-12T18:57:25Z) - When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with
Weak-and-Noisy Supervision [20.625754683390536]
We propose a complementary image prompt-induced weakly-supervised point cloud instance segmentation (CIP-WPIS) method.
We leverage pretrained knowledge embedded in the 2D foundation model SAM and 3D geometric prior to achieve accurate point-wise instance labels.
Our method is robust against noisy 3D bounding-box annotations and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-09-02T05:17:03Z) - Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation Using
Bounding Boxes [38.60444957213202]
We look at weakly-supervised 3D semantic instance segmentation.
Key idea is to leverage 3D bounding box labels which are easier and faster to annotate.
We show that it is possible to train dense segmentation models using only bounding box labels.
arXiv Detail & Related papers (2022-06-02T17:59:57Z) - Lifting 2D Object Locations to 3D by Discounting LiDAR Outliers across
Objects and Views [70.1586005070678]
We present a system for automatically converting 2D mask object predictions and raw LiDAR point clouds into full 3D bounding boxes of objects.
Our method significantly outperforms previous work despite the fact that those methods use significantly more complex pipelines, 3D models and additional human-annotated external sources of prior information.
arXiv Detail & Related papers (2021-09-16T13:01:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.