Related papers: MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene Cues

MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene Cues

URL: http://arxiv.org/abs/2404.05280v1
Date: Mon, 8 Apr 2024 08:11:56 GMT
Title: MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene Cues
Authors: Xiahan Chen, Mingjian Chen, Sanli Tang, Yi Niu, Jiang Zhu,
Abstract summary: We propose a novel framework, namely MOSE, for MOnocular 3D object detection with Scene cuEs. A scene cue bank is designed to aggregate scene cues from multiple frames of the same scene. A transformer-based decoder lifts the aggregated scene cues as well as the 3D position embeddings for 3D object location.
Score: 12.508548561872553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D object detection based on roadside cameras is an additional way for autonomous driving to alleviate the challenges of occlusion and short perception range from vehicle cameras. Previous methods for roadside 3D object detection mainly focus on modeling the depth or height of objects, neglecting the stationary of cameras and the characteristic of inter-frame consistency. In this work, we propose a novel framework, namely MOSE, for MOnocular 3D object detection with Scene cuEs. The scene cues are the frame-invariant scene-specific features, which are crucial for object localization and can be intuitively regarded as the height between the surface of the real road and the virtual ground plane. In the proposed framework, a scene cue bank is designed to aggregate scene cues from multiple frames of the same scene with a carefully designed extrinsic augmentation strategy. Then, a transformer-based decoder lifts the aggregated scene cues as well as the 3D position embeddings for 3D object location, which boosts generalization ability in heterologous scenes. The extensive experiment results on two public benchmarks demonstrate the state-of-the-art performance of the proposed method, which surpasses the existing methods by a large margin.

Related papers

TUN3D: Towards Real-World Scene Understanding from Unposed Images [11.23080017635425]
TUN3D is a new method that tackles joint layout estimation and 3D object detection in real scans.<n>It does not require ground-truth camera poses or depth supervision.<n>It achieves state-of-the-art performance across three challenging scene understanding benchmarks.
arXiv Detail & Related papers (2025-09-23T20:24:07Z)
2.5D Object Detection for Intelligent Roadside Infrastructure [37.07785188366053]
We introduce a 2.5D object detection framework for infrastructure roadside-mounted cameras.<n>We employ a prediction approach to detect ground planes of vehicles as parallelograms in the image frame.<n>Our results show high detection accuracy, strong cross-viewpoint generalization, and robustness to diverse lighting and weather conditions.
arXiv Detail & Related papers (2025-07-04T13:16:59Z)
Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping [8.332670136772558]
We revisit scene-level 3D object detection as the output of an object-centric framework capable of both localization and mapping.<n>By replacing the standard 2D keypoint-based matcher of structure-from-motion with an object-centric matcher based on image-derived 3D boxes, we estimate metric camera poses, object tracks, and finally produce a global, semantic 3D object map.
arXiv Detail & Related papers (2025-05-29T17:59:45Z)
Towards Flexible 3D Perception: Object-Centric Occupancy Completion Augments 3D Object Detection [54.78470057491049]
Occupancy has emerged as a promising alternative for 3D scene perception. We introduce object-centric occupancy as a supplement to object bboxes. We show that our occupancy features significantly enhance the detection results of state-of-the-art 3D object detectors.
arXiv Detail & Related papers (2024-12-06T16:12:38Z)
CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity [34.025530326420146]
We develop Complementary-BEV, a novel end-to-end monocular 3D object detection framework. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode.
arXiv Detail & Related papers (2023-10-04T13:38:53Z)
MagicDrive: Street View Generation with Diverse 3D Geometry Control [82.69871576797166]
We introduce MagicDrive, a novel street view generation framework, offering diverse 3D geometry controls. Our design incorporates a cross-view attention module, ensuring consistency across multiple camera views.
arXiv Detail & Related papers (2023-10-04T06:14:06Z)
Perspective-aware Convolution for Monocular 3D Object Detection [2.33877878310217]
We propose a novel perspective-aware convolutional layer that captures long-range dependencies in images. By enforcing convolutional kernels to extract features along the depth axis of every image pixel, we incorporates perspective information into network architecture. We demonstrate improved performance on the KITTI3D dataset, achieving a 23.9% average precision in the easy benchmark.
arXiv Detail & Related papers (2023-08-24T17:25:36Z)
NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization [80.3424839706698]
We present NeurOCS, a framework that uses instance masks 3D boxes as input to learn 3D object shapes by means of differentiable rendering. Our approach rests on insights in learning a category-level shape prior directly from real driving scenes. We make critical design choices to learn object coordinates more effectively from an object-centric view.
arXiv Detail & Related papers (2023-05-28T16:18:41Z)
OA-DET3D: Embedding Object Awareness as a General Plug-in for Multi-Camera 3D Object Detection [77.43427778037203]
We introduce OA-DET3D, a plug-in module that improves 3D object detection.<n> OA-DET3D boosts the representation of objects by leveraging object-centric depth information and foreground pseudo points.<n>We conduct extensive experiments on the nuScenes dataset and Argoverse 2 dataset to validate the merits of OA-DET3D.
arXiv Detail & Related papers (2023-01-13T06:02:31Z)
Rope3D: TheRoadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task [48.555440807415664]
We present the first high-diversity challenging Roadside Perception 3D dataset- Rope3D from a novel view. The dataset consists of 50k images and over 1.5M 3D objects in various scenes. We propose to leverage the geometry constraint to solve the inherent ambiguities caused by various sensors, viewpoints.
arXiv Detail & Related papers (2022-03-25T12:13:23Z)
M3DSSD: Monocular 3D Single Stage Object Detector [82.25793227026443]
We propose a Monocular 3D Single Stage object Detector (M3DSSD) with feature alignment and asymmetric non-local attention. The proposed M3DSSD achieves significantly better performance than the monocular 3D object detection methods on the KITTI dataset.
arXiv Detail & Related papers (2021-03-24T13:09:11Z)
Integration of the 3D Environment for UAV Onboard Visual Object Tracking [7.652259812856325]
Single visual object tracking from an unmanned aerial vehicle poses fundamental challenges. We introduce a pipeline that combines a model-free visual object tracker, a sparse 3D reconstruction, and a state estimator. By representing the position of the target in 3D space rather than in image space, we stabilize the tracking during ego-motion.
arXiv Detail & Related papers (2020-08-06T18:37:29Z)
Kinematic 3D Object Detection in Monocular Video [123.7119180923524]
We propose a novel method for monocular video-based 3D object detection which carefully leverages kinematic motion to improve precision of 3D localization. We achieve state-of-the-art performance on monocular 3D object detection and the Bird's Eye View tasks within the KITTI self-driving dataset.
arXiv Detail & Related papers (2020-07-19T01:15:12Z)
Single View Metrology in the Wild [94.7005246862618]
We present a novel approach to single view metrology that can recover the absolute scale of a scene represented by 3D heights of objects or camera height above the ground. Our method relies on data-driven priors learned by a deep network specifically designed to imbibe weakly supervised constraints from the interplay of the unknown camera with 3D entities such as object heights. We demonstrate state-of-the-art qualitative and quantitative results on several datasets as well as applications including virtual object insertion.
arXiv Detail & Related papers (2020-07-18T22:31:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.