Related papers: OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

URL: http://arxiv.org/abs/2403.19580v2
Date: Tue, 23 Jul 2024 02:20:00 GMT
Title: OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation
Authors: Zhenyu Wang, Yali Li, Taichi Liu, Hengshuang Zhao, Shengjin Wang,
Abstract summary: OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6% on average. Code and pre-trained models will be released later.
Score: 67.56268991234371
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the current state of 3D object detection research, the severe scarcity of annotated 3D data, substantial disparities across different data modalities, and the absence of a unified architecture, have impeded the progress towards the goal of universality. In this paper, we propose \textbf{OV-Uni3DETR}, a unified open-vocabulary 3D detector via cycle-modality propagation. Compared with existing 3D detectors, OV-Uni3DETR offers distinct advantages: 1) Open-vocabulary 3D detection: During training, it leverages various accessible data, especially extensive 2D detection images, to boost training diversity. During inference, it can detect both seen and unseen classes. 2) Modality unifying: It seamlessly accommodates input data from any given modality, effectively addressing scenarios involving disparate modalities or missing sensor information, thereby supporting test-time modality switching. 3) Scene unifying: It provides a unified multi-modal model architecture for diverse scenes collected by distinct sensors. Specifically, we propose the cycle-modality propagation, aimed at propagating knowledge bridging 2D and 3D modalities, to support the aforementioned functionalities. 2D semantic knowledge from large-vocabulary learning guides novel class discovery in the 3D domain, and 3D geometric knowledge provides localization supervision for 2D detection images. OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6\% on average. Its performance using only RGB images is on par with or even surpasses that of previous point cloud based methods. Code and pre-trained models will be released later.

Related papers

OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations [21.24895455233531]
We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations.<n>OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model.<n>At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed.
arXiv Detail & Related papers (2025-08-27T17:17:00Z)
TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment [14.535056813802527]
Testing-time Distribution Alignment (TeDA) is a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time.<n>TeDA projects 3D objects into multi-view images, extracts features using CLIP, and refines 3D query embeddings.<n>Experiments on four open-set 3D object retrieval benchmarks demonstrate TeDA greatly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-05-05T02:47:07Z)
xMOD: Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion [4.878192303432336]
DIOD-3D is the first baseline for multi-object discovery in 3D data using 2D motion. xMOD is a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues. Our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets.
arXiv Detail & Related papers (2025-03-19T09:20:35Z)
DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation [49.32104127246474]
DriveGEN is a training-free controllable Text-to-Image Diffusion Generation. It consistently preserves objects with precise 3D geometry across diverse Out-of-Distribution generations.
arXiv Detail & Related papers (2025-03-14T06:35:38Z)
Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data [57.53523870705433]
We propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det. OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes. It employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors.
arXiv Detail & Related papers (2024-11-23T21:37:21Z)
ImOV3D: Learning Open-Vocabulary Point Clouds 3D Object Detection from Only 2D Images [19.02348585677397]
Open-vocabulary 3D object detection (OV-3Det) aims to generalize beyond the limited number of base categories labeled during the training phase. The biggest bottleneck is the scarcity of annotated 3D data, whereas 2D image datasets are abundant and richly annotated. We propose a novel framework ImOV3D to leverage pseudo multimodal representation containing both images and point clouds (PC) to close the modality gap.
arXiv Detail & Related papers (2024-10-31T15:02:05Z)
Dual-Perspective Knowledge Enrichment for Semi-Supervised 3D Object Detection [55.210991151015534]
We present a novel Dual-Perspective Knowledge Enrichment approach named DPKE for semi-supervised 3D object detection. Our DPKE enriches the knowledge of limited training data, particularly unlabeled data, from two perspectives: data-perspective and feature-perspective.
arXiv Detail & Related papers (2024-01-10T08:56:07Z)
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images. We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image. We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z)
Homography Loss for Monocular 3D Object Detection [54.04870007473932]
A differentiable loss function, termed as Homography Loss, is proposed to achieve the goal, which exploits both 2D and 3D information. Our method yields the best performance compared with the other state-of-the-arts by a large margin on KITTI 3D datasets.
arXiv Detail & Related papers (2022-04-02T03:48:03Z)
DetMatch: Two Teachers are Better Than One for Joint 2D and 3D Semi-Supervised Object Detection [29.722784254501768]
DetMatch is a flexible framework for joint semi-supervised learning on 2D and 3D modalities. By identifying objects detected in both sensors, our pipeline generates a cleaner, more robust set of pseudo-labels. We leverage the richer semantics of RGB images to rectify incorrect 3D class predictions and improve localization of 3D boxes.
arXiv Detail & Related papers (2022-03-17T17:58:00Z)
Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds [96.9027094562957]
We introduce a-temporal representation learning framework, capable of learning from unlabeled tasks. Inspired by how infants learn from visual data in the wild, we explore rich cues derived from the 3D data. STRL takes two temporally-related frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly.
arXiv Detail & Related papers (2021-09-01T04:17:11Z)
Self-supervised Feature Learning by Cross-modality and Cross-view Correspondences [32.01548991331616]
This paper presents a novel self-supervised learning approach to learn both 2D image features and 3D point cloud features. It exploits cross-modality and cross-view correspondences without using any annotated human labels. The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks.
arXiv Detail & Related papers (2020-04-13T02:57:25Z)
SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation [3.1542695050861544]
Estimating 3D orientation and translation of objects is essential for infrastructure-less autonomous navigation and driving. We propose a novel 3D object detection method, named SMOKE, that combines a single keypoint estimate with regressed 3D variables. Despite of its structural simplicity, our proposed SMOKE network outperforms all existing monocular 3D detection methods on the KITTI dataset.
arXiv Detail & Related papers (2020-02-24T08:15:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.