A Unified 3D Object Perception Framework for Real-Time Outside-In Multi-Camera Systems
- URL: http://arxiv.org/abs/2601.10819v1
- Date: Thu, 15 Jan 2026 19:31:37 GMT
- Title: A Unified 3D Object Perception Framework for Real-Time Outside-In Multi-Camera Systems
- Authors: Yizhou Wang, Sameer Pusegaonkar, Yuxing Wang, Anqi Li, Vishal Kumar, Chetan Sethi, Ganapathy Aiyer, Yun He, Kartikay Thakkar, Swapnil Rathi, Bhushan Rupde, Zheng Tang, Sujit Biswas,
- Abstract summary: We present an adapted Sparse4D framework specifically optimized for large-scale infrastructure environments.<n>We employ a generative data augmentation strategy using the NVIDIA COSMOS framework to bridge the Sim2Real domain gap.<n> evaluated on the AI City Challenge 2025 benchmark, our camera-only framework achieves a state-of-the-art HOTA of $45.22$.
- Score: 16.644881371951175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate 3D object perception and multi-target multi-camera (MTMC) tracking are fundamental for the digital transformation of industrial infrastructure. However, transitioning "inside-out" autonomous driving models to "outside-in" static camera networks presents significant challenges due to heterogeneous camera placements and extreme occlusion. In this paper, we present an adapted Sparse4D framework specifically optimized for large-scale infrastructure environments. Our system leverages absolute world-coordinate geometric priors and introduces an occlusion-aware ReID embedding module to maintain identity stability across distributed sensor networks. To bridge the Sim2Real domain gap without manual labeling, we employ a generative data augmentation strategy using the NVIDIA COSMOS framework, creating diverse environmental styles that enhance the model's appearance-invariance. Evaluated on the AI City Challenge 2025 benchmark, our camera-only framework achieves a state-of-the-art HOTA of $45.22$. Furthermore, we address real-time deployment constraints by developing an optimized TensorRT plugin for Multi-Scale Deformable Aggregation (MSDA). Our hardware-accelerated implementation achieves a $2.15\times$ speedup on modern GPU architectures, enabling a single Blackwell-class GPU to support over 64 concurrent camera streams.
Related papers
- UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models [54.564740558030245]
We present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism.<n>We also introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting.
arXiv Detail & Related papers (2026-02-26T12:54:46Z) - FlexMap: Generalized HD Map Construction from Flexible Camera Configurations [29.3161377210518]
High-definition (HD) maps provide essential semantic information of road structures for autonomous driving systems.<n>Current HD map construction methods require calibrated multi-camera setups and implicit or explicit 2D-to-BEV transformations.<n>We introduce FlexMap, unlike prior methods that are fixed to a specific N-camera rig, our approach adapts to variable camera configurations without any architectural changes.
arXiv Detail & Related papers (2026-01-29T22:41:11Z) - MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z) - WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting [51.69408870574092]
We present WorldMirror, an all-in-one, feed-forward model for versatile 3D geometric prediction tasks.<n>Our framework flexibly integrates diverse geometric priors, including camera poses, intrinsics, and depth maps.<n>WorldMirror achieves state-of-the-art performance across diverse benchmarks from camera, point map, depth, and surface normal estimation to novel view synthesis.
arXiv Detail & Related papers (2025-10-12T17:59:09Z) - MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion [13.24058110580706]
We propose a novel global motion averaging framework for multi-camera systems.<n>Our system matches or exceeds incremental SfM accuracy while significantly improving efficiency.
arXiv Detail & Related papers (2025-07-04T05:25:00Z) - S3MOT: Monocular 3D Object Tracking with Selective State Space Model [3.5047603107971397]
Multi-object tracking in 3D space is essential for advancing robotics and computer applications.<n>It remains a significant challenge in monocular setups due to the difficulty of mining 3D associations from 2D video streams.<n>We present three innovative techniques to enhance the fusion of heterogeneous cues for monocular 3D MOT.
arXiv Detail & Related papers (2025-04-25T04:45:35Z) - FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video [52.33896173943054]
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications.<n>Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings.<n>We propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction.
arXiv Detail & Related papers (2025-03-29T14:26:06Z) - Multi-modal Multi-platform Person Re-Identification: Benchmark and Method [58.59888754340054]
MP-ReID is a novel dataset designed specifically for multi-modality and multi-platform ReID.<n>This benchmark compiles data from 1,930 identities across diverse modalities, including RGB, infrared, and thermal imaging.<n>We introduce Uni-Prompt ReID, a framework with specific-designed prompts, tailored for cross-modality and cross-platform scenarios.
arXiv Detail & Related papers (2025-03-21T12:27:49Z) - MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies [12.485905108032146]
This paper introduces MetaOcc, a novel multi-modal framework for omni-oriented 3D occupancy prediction.<n>To address the limitations of directly applying encoders to sparse radar data, we propose a Radar Height Self-Attention module.<n>To reduce reliance on expensive point cloud, we propose a pseudo-label generation pipeline based on an open-set segmentor.
arXiv Detail & Related papers (2025-01-26T03:51:56Z) - VICAN: Very Efficient Calibration Algorithm for Large Camera Networks [49.17165360280794]
We introduce a novel methodology that extends Pose Graph Optimization techniques.
We consider the bipartite graph encompassing cameras, object poses evolving dynamically, and camera-object relative transformations at each time step.
Our framework retains compatibility with traditional PGO solvers, but its efficacy benefits from a custom-tailored optimization scheme.
arXiv Detail & Related papers (2024-03-25T17:47:03Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [105.96557764248846]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.