Related papers: Model Optimization for Multi-Camera 3D Detection and Tracking

Model Optimization for Multi-Camera 3D Detection and Tracking

URL: http://arxiv.org/abs/2602.00450v2
Date: Tue, 03 Feb 2026 17:47:07 GMT
Title: Model Optimization for Multi-Camera 3D Detection and Tracking
Authors: Ethan Anderson, Justin Silva, Kyle Zheng, Sameer Pusegaonkar, Yizhou Wang, Zheng Tang, Sujit Biswas,
Abstract summary: Outside-in multi-camera perception is increasingly important in indoor environments.<n>We evaluate Sparse4D, a query-based 3D detection and tracking framework.<n>We study reduced input frame rates, post-training quantization, transfer to the WILDTRACK benchmark, and Transformer Engine mixedprecision fine-tuning.
Score: 13.756560739163362
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.

Related papers

SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction [3.08657139423562]
We introduce textbfSOTFormer, a minimal constant-memory temporal transformer.<n>It unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework.<n>On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM)
arXiv Detail & Related papers (2025-11-14T19:25:05Z)
Color-Pair Guided Robust Zero-Shot 6D Pose Estimation and Tracking of Cluttered Objects on Edge Devices [4.261261166281339]
We present a unified framework explicitly designed for efficient execution on edge devices.<n>Key to our approach is a shared, lighting-invariant color-pair feature representation.<n>For initial estimation, this feature facilitates robust registration between the live RGB-D view and the object's 3D mesh.<n>For tracking, the same feature logic validates temporal correspondences, enabling a lightweight model to reliably regress the object's motion.
arXiv Detail & Related papers (2025-09-28T05:07:49Z)
Sparse BEV Fusion with Self-View Consistency for Multi-View Detection and Tracking [15.680801582969393]
We propose SCFusion, a framework that combines three techniques to improve multi-view feature integration.<n> SCFusion achieves state-of-the-art performance, reaching an IDF1 score of 95.9% on WildTrack and a MODP of 89.2% on MultiviewX.
arXiv Detail & Related papers (2025-09-10T09:06:41Z)
An End-to-End Framework for Video Multi-Person Pose Estimation [3.090225730976977]
We propose VEPE (Video Endto-End Pose Estimation), a simple and flexible framework for end-to-end pose estimation in video.<n>We show that our approach outperforms two-stage models by 300% and by inference by 300%.
arXiv Detail & Related papers (2025-09-01T03:34:57Z)
Reliability-Driven LiDAR-Camera Fusion for Robust 3D Object Detection [0.0]
We propose ReliFusion, a LiDAR-camera fusion framework operating in the bird's-eye view (BEV) space.<n>ReliFusion integrates three key components: the Spatio-Temporal Feature Aggregation (STFA) module, the Reliability module, and the Confidence-Weighted Mutual Cross-Attention (CW-MCA) module.<n>Experiments on the nuScenes dataset show that ReliFusion significantly outperforms state-of-the-art methods, achieving superior robustness and accuracy in scenarios with limited LiDAR fields of view and severe sensor malfunctions.
arXiv Detail & Related papers (2025-02-03T22:07:14Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
Minimum Latency Deep Online Video Stabilization [77.68990069996939]
We present a novel camera path optimization framework for the task of online video stabilization. In this work, we adopt recent off-the-shelf high-quality deep motion models for motion estimation to recover the camera trajectory. Our approach significantly outperforms state-of-the-art online methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-12-05T07:37:32Z)
Uncertainty-Aware Camera Pose Estimation from Points and Lines [101.03675842534415]
Perspective-n-Point-and-Line (Pn$PL) aims at fast, accurate and robust camera localizations with respect to a 3D model from 2D-3D feature coordinates.
arXiv Detail & Related papers (2021-07-08T15:19:36Z)
Self-Supervised Multi-Frame Monocular Scene Flow [61.588808225321735]
We introduce a multi-frame monocular scene flow network based on self-supervised learning. We observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.
arXiv Detail & Related papers (2021-05-05T17:49:55Z)
Towards Fast, Accurate and Stable 3D Dense Face Alignment [73.01620081047336]
We propose a novel regression framework named 3DDFA-V2 which makes a balance among speed, accuracy and stability. We present a virtual synthesis method to transform one still image to a short-video which incorporates in-plane and out-of-plane face moving.
arXiv Detail & Related papers (2020-09-21T15:37:37Z)
Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image. Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space. We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step. This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.