From Features to Reference Points: Lightweight and Adaptive Fusion for Cooperative Autonomous Driving
- URL: http://arxiv.org/abs/2511.18757v1
- Date: Mon, 24 Nov 2025 04:38:57 GMT
- Title: From Features to Reference Points: Lightweight and Adaptive Fusion for Cooperative Autonomous Driving
- Authors: Yongqi Zhu, Morui Zhu, Qi Chen, Deyuan Qu, Song Fu, Qing Yang,
- Abstract summary: RefPtsFusion is a lightweight and interpretable framework for cooperative autonomous driving.<n>Instead of sharing large feature maps or query embeddings, vehicles exchange compact reference points.<n>This approach shifts the focus from "what is seen" to "where to see", creating a sensor- and model-independent interface.
- Score: 11.678736288487888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present RefPtsFusion, a lightweight and interpretable framework for cooperative autonomous driving. Instead of sharing large feature maps or query embeddings, vehicles exchange compact reference points, e.g., objects' positions, velocities, and size information. This approach shifts the focus from "what is seen" to "where to see", creating a sensor- and model-independent interface that works well across vehicles with heterogeneous perception models while greatly reducing communication bandwidth. To enhance the richness of shared information, we further develop a selective Top-K query fusion that selectively adds high-confidence queries from the sender. It thus achieves a strong balance between accuracy and communication cost. Experiments on the M3CAD dataset show that RefPtsFusion maintains stable perception performance while reducing communication overhead by five orders of magnitude, dropping from hundreds of MB/s to only a few KB/s at 5 FPS (frame per second), compared to traditional feature-level fusion methods. Extensive experiments also demonstrate RefPtsFusion's strong robustness and consistent transmission behavior, highlighting its potential for scalable, real-time cooperative driving systems.
Related papers
- DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking [10.270441242480482]
This paper proposes DM$3$T, a novel framework that reformulates multimodal fusion as an iterative feature alignment process.<n>Our approach performs iterative cross-modal harmonization through a proposed Cross-Modal Diffusion Fusion (C-MDF) module.<n>To further improve tracking robustness, we design a Hierarchical Tracker that adaptively handles confidence estimation.
arXiv Detail & Related papers (2025-11-28T06:02:58Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - Semantic Communication for Cooperative Perception using HARQ [51.148203799109304]
We leverage an importance map to distill critical semantic information, introducing a cooperative perception semantic communication framework.
To counter the challenges posed by time-varying multipath fading, our approach incorporates the use of frequency-division multiplexing (OFDM) along with channel estimation and equalization strategies.
We introduce a novel semantic error detection method that is integrated with our semantic communication framework in the spirit of hybrid automatic repeated request (HARQ)
arXiv Detail & Related papers (2024-08-29T08:53:26Z) - HEAD: A Bandwidth-Efficient Cooperative Perception Approach for Heterogeneous Connected and Autonomous Vehicles [9.10239345027499]
HEAD is a method that fuses features from the classification and regression heads in 3D object detection networks.
Our experiments demonstrate that HEAD is a fusion method that effectively balances communication bandwidth and perception performance.
arXiv Detail & Related papers (2024-08-27T22:05:44Z) - MR3D-Net: Dynamic Multi-Resolution 3D Sparse Voxel Grid Fusion for LiDAR-Based Collective Perception [0.5714074111744111]
We propose MR3D-Net, a dynamic multi-resolution 3D sparse voxel grid fusion backbone architecture for LiDAR-based collective perception.
We show that sparse voxel grids at varying resolutions provide a meaningful and compact environment representation that can adapt to the communication bandwidth.
arXiv Detail & Related papers (2024-08-12T13:27:11Z) - Fusion-Mamba for Cross-modality Object Detection [63.56296480951342]
Cross-modality fusing information from different modalities effectively improves object detection performance.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction.
Our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets.
arXiv Detail & Related papers (2024-04-14T05:28:46Z) - Cooperative Perception with Learning-Based V2V communications [11.772899644895281]
This work analyzes the performance of cooperative perception accounting for communications channel impairments.
A new late fusion scheme is proposed to leverage the robustness of intermediate features.
In order to compress the data size incurred by cooperation, a convolution neural network-based autoencoder is adopted.
arXiv Detail & Related papers (2023-11-17T05:41:23Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - TransFuser: Imitation with Transformer-Based Sensor Fusion for
Autonomous Driving [46.409930329699336]
We propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention.
Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps.
We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator.
arXiv Detail & Related papers (2022-05-31T17:57:19Z) - COOPERNAUT: End-to-End Driving with Cooperative Perception for Networked
Vehicles [54.61668577827041]
We introduce COOPERNAUT, an end-to-end learning model that uses cross-vehicle perception for vision-based cooperative driving.
Our experiments on AutoCastSim suggest that our cooperative perception driving models lead to a 40% improvement in average success rate.
arXiv Detail & Related papers (2022-05-04T17:55:12Z) - Keypoints-Based Deep Feature Fusion for Cooperative Vehicle Detection of
Autonomous Driving [2.6543018470131283]
We propose an efficient keypoints-based deep feature fusion framework, called FPV-RCNN, for collective perception.
Compared to a bird's-eye view (BEV) keypoints feature fusion, FPV-RCNN achieves improved detection accuracy by about 14%.
Our method also significantly decreases the CPM size to less than 0.3KB, which is about 50 times smaller than the BEV feature map sharing used in previous works.
arXiv Detail & Related papers (2021-09-23T19:41:02Z) - Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [59.60483620730437]
We propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention.
Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
arXiv Detail & Related papers (2021-04-19T11:48:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.