Related papers: UniDet3D: Multi-dataset Indoor 3D Object Detection

UniDet3D: Multi-dataset Indoor 3D Object Detection

URL: http://arxiv.org/abs/2409.04234v1
Date: Fri, 6 Sep 2024 12:40:19 GMT
Title: UniDet3D: Multi-dataset Indoor 3D Object Detection
Authors: Maksim Kolodiazhnyi, Anna Vorontsova, Matvey Skripkin, Danila Rukhovich, Anton Konushin,
Abstract summary: ours is a simple yet effective 3D object detection model. It is trained on a mixture of indoor datasets and is capable of working in various indoor environments.
Score: 4.718582862677851
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Growing customer demand for smart solutions in robotics and augmented reality has attracted considerable attention to 3D object detection from point clouds. Yet, existing indoor datasets taken individually are too small and insufficiently diverse to train a powerful and general 3D object detection model. In the meantime, more general approaches utilizing foundation models are still inferior in quality to those based on supervised training for a specific task. In this work, we propose \ours{}, a simple yet effective 3D object detection model, which is trained on a mixture of indoor datasets and is capable of working in various indoor environments. By unifying different label spaces, \ours{} enables learning a strong representation across multiple datasets through a supervised joint training scheme. The proposed network architecture is built upon a vanilla transformer encoder, making it easy to run, customize and extend the prediction pipeline for practical use. Extensive experiments demonstrate that \ours{} obtains significant gains over existing 3D object detection methods in 6 indoor benchmarks: ScanNet (+1.1 mAP50), ARKitScenes (+19.4 mAP25), S3DIS (+9.1 mAP50), MultiScan (+9.3 mAP50), 3RScan (+3.2 mAP50), and ScanNet++ (+2.7 mAP50). Code is available at https://github.com/filapro/unidet3d .

Related papers

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection [58.78881632019072]
We introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD)<n>We lift the open-set 2D detection into 3D space through our designed 3D bounding box head.<n>We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes.
arXiv Detail & Related papers (2025-07-31T13:56:41Z)
Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D [68.23391872643268]
LOCATE 3D is a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp" It operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices.
arXiv Detail & Related papers (2025-04-19T02:51:24Z)
Sparse Points to Dense Clouds: Enhancing 3D Detection with Limited LiDAR Data [68.18735997052265]
We propose a balanced approach that combines the advantages of monocular and point cloud-based 3D detection. Our method requires only a small number of 3D points, that can be obtained from a low-cost, low-resolution sensor. The accuracy of 3D detection improves by 20% compared to the state-of-the-art monocular detection methods.
arXiv Detail & Related papers (2024-04-10T03:54:53Z)
M&M3D: Multi-Dataset Training and Efficient Network for Multi-view 3D Object Detection [2.5158048364984564]
I proposed a network structure for multi-view 3D object detection using camera-only data and a Bird's-Eye-View map. My work is based on a current key challenge domain adaptation and visual data transfer. My study utilizes 3D information as available semantic information and 2D multi-view image features blending into the visual-language transfer design.
arXiv Detail & Related papers (2023-11-02T04:28:51Z)
Towards Robust Robot 3D Perception in Urban Environments: The UT Campus Object Dataset [7.665779592030094]
CODa is a mobile robot egocentric perception dataset collected on the University of Texas Austin Campus. Our dataset contains 8.5 hours of multimodal sensor data: synchronized 3D point clouds and stereo RGB video from a 128-channel 3D LiDAR and two 1.25MP RGB cameras at 10 fps. We provide 58 minutes of ground-truth annotations containing 1.3 million 3D bounding boxes with instance IDs for 53 semantic classes, 5000 frames of 3D semantic annotations for urban terrain.
arXiv Detail & Related papers (2023-09-24T04:43:39Z)
FocalFormer3D : Focusing on Hard Instance for 3D Object Detection [97.56185033488168]
False negatives (FN) in 3D object detection can lead to potentially dangerous situations in autonomous driving. In this work, we propose Hard Instance Probing (HIP), a general pipeline that identifies textitFN in a multi-stage manner. We instantiate this method as FocalFormer3D, a simple yet effective detector that excels at excavating difficult objects.
arXiv Detail & Related papers (2023-08-08T20:06:12Z)
Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding [40.68012530554327]
We introduce a pretrained 3D backbone, called SST, for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity. A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach.
arXiv Detail & Related papers (2023-04-14T02:49:08Z)
ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z)
FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection [3.330229314824913]
We present FCAF3D - a first-in-class fully convolutional anchor-free indoor 3D object detection method. It is a simple yet effective method that uses a voxel representation of a point cloud and processes voxels with sparse convolutions. It can handle large-scale scenes with minimal runtime through a single fully convolutional feed-forward pass.
arXiv Detail & Related papers (2021-12-01T07:28:52Z)
Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem. We employ a Neural Message Passing network for data association that is fully trainable. We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z)
Self-Supervised Pretraining of 3D Features on any Point-Cloud [40.26575888582241]
We present a simple self-supervised pertaining method that can work with any 3D data without 3D registration. We evaluate our models on 9 benchmarks for object detection, semantic segmentation, and object classification, where they achieve state-of-the-art results and can outperform supervised pretraining.
arXiv Detail & Related papers (2021-01-07T18:55:21Z)
Weakly Supervised 3D Object Detection from Lidar Point Cloud [182.67704224113862]
It is laborious to manually label point cloud data for training high-quality 3D object detectors. This work proposes a weakly supervised approach for 3D object detection, only requiring a small set of weakly annotated scenes. Using only 500 weakly annotated scenes and 534 precisely labeled vehicle instances, our method achieves 85-95% the performance of current top-leading, fully supervised detectors.
arXiv Detail & Related papers (2020-07-23T10:12:46Z)
D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features [51.04841465193678]
We leverage a 3D fully convolutional network for 3D point clouds. We propose a novel and practical learning mechanism that densely predicts both a detection score and a description feature for each 3D point. Our method achieves state-of-the-art results in both indoor and outdoor scenarios.
arXiv Detail & Related papers (2020-03-06T12:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.