Related papers: TUN3D: Towards Real-World Scene Understanding from Unposed Images

TUN3D: Towards Real-World Scene Understanding from Unposed Images

URL: http://arxiv.org/abs/2509.21388v1
Date: Tue, 23 Sep 2025 20:24:07 GMT
Title: TUN3D: Towards Real-World Scene Understanding from Unposed Images
Authors: Anton Konushin, Nikita Drozdov, Bulat Gabdullin, Alexey Zakharov, Anna Vorontsova, Danila Rukhovich, Maksim Kolodiazhnyi,
Abstract summary: TUN3D is a new method that tackles joint layout estimation and 3D object detection in real scans.<n>It does not require ground-truth camera poses or depth supervision.<n>It achieves state-of-the-art performance across three challenging scene understanding benchmarks.
Score: 11.23080017635425
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .

Related papers

Z3D: Zero-Shot 3D Visual Grounding from Images [7.756226313216256]
3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries.<n>We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images.
arXiv Detail & Related papers (2026-02-03T10:35:18Z)
Sparse Multiview Open-Vocabulary 3D Detection [27.57172918603858]
3D object detection has traditionally been solved by training to detect a fixed set of categories.<n>In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting.<n>Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion.
arXiv Detail & Related papers (2025-09-19T12:22:24Z)
3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection [62.57179069154312]
We introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD)<n>We lift the open-set 2D detection into 3D space through our designed 3D bounding box head.<n>We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes.
arXiv Detail & Related papers (2025-07-31T13:56:41Z)
SeSame: Simple, Easy 3D Object Detection with Point-Wise Semantics [0.7373617024876725]
In autonomous driving, 3D object detection provides more precise information for downstream tasks, including path planning and motion estimation. We propose SeSame: a method aimed at enhancing semantic information in existing LiDAR-only based 3D object detection. Experiments demonstrate the effectiveness of our method with performance improvements on the KITTI object detection benchmark.
arXiv Detail & Related papers (2024-03-11T08:17:56Z)
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images. We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image. We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z)
SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection [19.75965521357068]
We propose a novel approach called SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection) to improve the accuracy of 3D object detection. Our results show that SOGDet consistently enhance the performance of three baseline methods in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP) This indicates that the combination of 3D object detection and 3D semantic occupancy leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems.
arXiv Detail & Related papers (2023-08-26T07:38:21Z)
Perspective-aware Convolution for Monocular 3D Object Detection [2.33877878310217]
We propose a novel perspective-aware convolutional layer that captures long-range dependencies in images. By enforcing convolutional kernels to extract features along the depth axis of every image pixel, we incorporates perspective information into network architecture. We demonstrate improved performance on the KITTI3D dataset, achieving a 23.9% average precision in the easy benchmark.
arXiv Detail & Related papers (2023-08-24T17:25:36Z)
3D Small Object Detection with Dynamic Spatial Pruning [62.72638845817799]
We propose an efficient feature pruning strategy for 3D small object detection. We present a multi-level 3D detector named DSPDet3D which benefits from high spatial resolution. It takes less than 2s to directly process a whole building consisting of more than 4500k points while detecting out almost all objects.
arXiv Detail & Related papers (2023-05-05T17:57:04Z)
SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving. We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z)
Monocular 3D Object Detection with Depth from Motion [74.29588921594853]
We take advantage of camera ego-motion for accurate object depth estimation and detection. Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon. Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark.
arXiv Detail & Related papers (2022-07-26T15:48:46Z)
Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection [17.526914782562528]
We propose Graph-DETR3D to automatically aggregate multi-view imagery information through graph structure learning (GSL) Our best model achieves 49.5 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various published image-view 3D object detectors.
arXiv Detail & Related papers (2022-04-25T12:10:34Z)
PLUME: Efficient 3D Object Detection from Stereo Images [95.31278688164646]
Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space. We propose a model that unifies these two tasks in the same metric space. Our approach achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.
arXiv Detail & Related papers (2021-01-17T05:11:38Z)
Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras. We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.