Li-ViP3D++: Query-Gated Deformable Camera-LiDAR Fusion for End-to-End Perception and Trajectory Prediction
- URL: http://arxiv.org/abs/2601.20720v1
- Date: Wed, 28 Jan 2026 15:53:32 GMT
- Title: Li-ViP3D++: Query-Gated Deformable Camera-LiDAR Fusion for End-to-End Perception and Trajectory Prediction
- Authors: Matej Halinkovic, Nina Masarykova, Alexey Vinel, Marek Galinski,
- Abstract summary: Li-ViP3D++ is a query-based.<n>attention framework for end-to-end.<n>perception and trajectory prediction from raw sensor data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: End-to-end perception and trajectory prediction from raw sensor data is one of the key capabilities for autonomous driving. Modular pipelines restrict information flow and can amplify upstream errors. Recent query-based, fully differentiable perception-and-prediction (PnP) models mitigate these issues, yet the complementarity of cameras and LiDAR in the query-space has not been sufficiently explored. Models often rely on fusion schemes that introduce heuristic alignment and discrete selection steps which prevent full utilization of available information and can introduce unwanted bias. We propose Li-ViP3D++, a query-based multimodal PnP framework that introduces Query-Gated Deformable Fusion (QGDF) to integrate multi-view RGB and LiDAR in query space. QGDF (i) aggregates image evidence via masked attention across cameras and feature levels, (ii) extracts LiDAR context through fully differentiable BEV sampling with learned per-query offsets, and (iii) applies query-conditioned gating to adaptively weight visual and geometric cues per agent. The resulting architecture jointly optimizes detection, tracking, and multi-hypothesis trajectory forecasting in a single end-to-end model. On nuScenes, Li-ViP3D++ improves end-to-end behavior and detection quality, achieving higher EPA (0.335) and mAP (0.502) while substantially reducing false positives (FP ratio 0.147), and it is faster than the prior Li-ViP3D variant (139.82 ms vs. 145.91 ms). These results indicate that query-space, fully differentiable camera-LiDAR fusion can increase robustness of end-to-end PnP without sacrificing deployability.
Related papers
- Depth Completion as Parameter-Efficient Test-Time Adaptation [66.72360181325877]
CAPA is a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion.<n>For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency.
arXiv Detail & Related papers (2026-02-16T13:53:23Z) - LAMP: Data-Efficient Linear Affine Weight-Space Models for Parameter-Controlled 3D Shape Generation and Extrapolation [4.182541493191528]
We introduce LAMP, a framework for controllable and interpretable 3D generation.<n>We evaluate LAMP on two 3D parametric geometry benchmarks: DrivAerNet++ and BlendedNet.<n>Our results demonstrate that LAMP advances controllable, data-efficient, and safe 3D generation.
arXiv Detail & Related papers (2025-10-26T02:12:20Z) - Unsupervised Conformal Inference: Bootstrapping and Alignment to Control LLM Uncertainty [49.19257648205146]
We propose an unsupervised conformal inference framework for generation.<n>Our gates achieve close-to-nominal coverage and provide tighter, more stable thresholds than split UCP.<n>The result is a label-free, API-compatible gate for test-time filtering.
arXiv Detail & Related papers (2025-09-26T23:40:47Z) - SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving [56.198745862311824]
We introduce SQS, a novel query-based splatting pre-training for sparse Perception Models (SPMs)<n> SQS predicts 3D Gaussian representations from sparse queries during pre-training, leveraging self-supervised splatting to learn fine-grained contextual features.<n>Experiments on autonomous driving benchmarks demonstrate that SQS delivers considerable performance gains across multiple query-based 3D perception tasks.
arXiv Detail & Related papers (2025-09-20T09:25:19Z) - DIMM: Decoupled Multi-hierarchy Kalman Filter for 3D Object Tracking [50.038098341549095]
State estimation is challenging for 3D object tracking with high maneuverability.<n>We propose a novel framework, DIMM, to effectively combine estimates from different motion models in each direction.<n>DIMM significantly improves the tracking accuracy of existing state estimation methods by 31.61%99.23%.
arXiv Detail & Related papers (2025-05-18T10:12:41Z) - VaLID: Verification as Late Integration of Detections for LiDAR-Camera Fusion [2.503388496100123]
Vehicle object detection benefits from both LiDAR and camera data.<n>We propose a model-adaptive late-fusion method, VaLID, which validates whether each predicted bounding box is acceptable.<n>Our approach is model-adaptive and demonstrates state-of-the-art competitive performance even when using generic camera detectors.
arXiv Detail & Related papers (2024-09-23T20:27:10Z) - Let's Roll: Synthetic Dataset Analysis for Pedestrian Detection Across
Different Shutter Types [7.0441427250832644]
This paper studies the impact of different shutter mechanisms on machine learning (ML) object detection models on a synthetic dataset.
In particular, we train and evaluate mainstream detection models with our synthetically-generated paired GS and RS datasets.
arXiv Detail & Related papers (2023-09-15T04:07:42Z) - DELO: Deep Evidential LiDAR Odometry using Partial Optimal Transport [23.189529003370303]
Real-time LiDAR-based odometry is imperative for many applications like robot navigation, globally consistent 3D scene map reconstruction, or safe motion-planning.
We introduce a novel deep learning-based real-time (approx. 35-40ms per frame) LO method that jointly learns accurate frame-to-frame correspondences and model's predictive uncertainty (PU) as evidence to safe-guard LO predictions.
We evaluate our method on KITTI dataset and show competitive performance, even superior generalization ability over recent state-of-the-art approaches.
arXiv Detail & Related papers (2023-08-14T14:06:21Z) - PTA-Det: Point Transformer Associating Point cloud and Image for 3D
Object Detection [3.691671505269693]
Most multi-modal detection methods perform even worse than LiDAR-only methods.
A Pseudo Point Cloud Generation Network is proposed to convert image information by pseudo points.
The features of LiDAR points and pseudo points from image can be deeply fused under a unified point-based representation.
arXiv Detail & Related papers (2023-01-18T04:35:49Z) - Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object
Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR.
fusing these two modalities can significantly boost the performance of 3D perception models.
We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.