CoCMT: Communication-Efficient Cross-Modal Transformer for Collaborative Perception
- URL: http://arxiv.org/abs/2503.13504v1
- Date: Thu, 13 Mar 2025 06:41:25 GMT
- Title: CoCMT: Communication-Efficient Cross-Modal Transformer for Collaborative Perception
- Authors: Rujia Wang, Xiangbo Gao, Hao Xiang, Runsheng Xu, Zhengzhong Tu,
- Abstract summary: Multi-agent collaborative perception enhances each agent's capabilities by sharing sensing information to cooperatively perform robot perception tasks.<n>Existing representative collaborative perception systems transmit intermediate feature maps, which contain significant amount of non-critical information.<n>We introduce CoCMT, an object-query-based collaboration framework that maximizes communication bandwidth by selectively extracting and transmitting essential features.
- Score: 14.619784179608361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-agent collaborative perception enhances each agent perceptual capabilities by sharing sensing information to cooperatively perform robot perception tasks. This approach has proven effective in addressing challenges such as sensor deficiencies, occlusions, and long-range perception. However, existing representative collaborative perception systems transmit intermediate feature maps, such as bird-eye view (BEV) representations, which contain a significant amount of non-critical information, leading to high communication bandwidth requirements. To enhance communication efficiency while preserving perception capability, we introduce CoCMT, an object-query-based collaboration framework that optimizes communication bandwidth by selectively extracting and transmitting essential features. Within CoCMT, we introduce the Efficient Query Transformer (EQFormer) to effectively fuse multi-agent object queries and implement a synergistic deep supervision to enhance the positive reinforcement between stages, leading to improved overall performance. Experiments on OPV2V and V2V4Real datasets show CoCMT outperforms state-of-the-art methods while drastically reducing communication needs. On V2V4Real, our model (Top-50 object queries) requires only 0.416 Mb bandwidth, 83 times less than SOTA methods, while improving AP70 by 1.1 percent. This efficiency breakthrough enables practical collaborative perception deployment in bandwidth-constrained environments without sacrificing detection accuracy.
Related papers
- Is Intermediate Fusion All You Need for UAV-based Collaborative Perception? [1.8689461238197957]
We propose a novel communication-efficient collaborative perception framework based on late-intermediate fusion, dubbed LIF.
We leverage vision-guided positional embedding (VPE) and box-based virtual augmented feature (BoBEV) to effectively integrate complementary information from various agents.
Experimental results demonstrate that our LIF achieves superior performance with minimal communication bandwidth, proving its effectiveness and practicality.
arXiv Detail & Related papers (2025-04-30T16:22:14Z) - Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [49.77734021302196]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework.<n>To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features.<n>Results show that TOFC achieves up to 60% reduction in data transmission overhead and 50% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z) - CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization [23.958663737034318]
We propose a novel communication-efficient collaborative perception framework based on supply-demand awareness and intermediate-late hybridization.<n>Experiments on multiple datasets, including both simulated and real-world scenarios, demonstrate that mymethodname achieves state-of-the-art detection accuracy and optimal bandwidth trade-offs.
arXiv Detail & Related papers (2025-03-05T12:02:04Z) - CoopDETR: A Unified Cooperative Perception Framework for 3D Detection via Object Query [21.010741892266136]
CoopDETR is a novel cooperative perception framework that introduces object-level feature cooperation via object query.<n>Our framework consists of two key modules: single-agent query generation, which efficiently encodes raw sensor data into object queries, and cross-agent query fusion.<n>Experiments on the OPV2V and V2XSet datasets demonstrate that CoopDETR achieves state-of-the-art performance and significantly reduces transmission costs to 1/782 of previous methods.
arXiv Detail & Related papers (2025-02-26T17:09:51Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - Semantic Communication for Cooperative Perception using HARQ [51.148203799109304]
We leverage an importance map to distill critical semantic information, introducing a cooperative perception semantic communication framework.
To counter the challenges posed by time-varying multipath fading, our approach incorporates the use of frequency-division multiplexing (OFDM) along with channel estimation and equalization strategies.
We introduce a novel semantic error detection method that is integrated with our semantic communication framework in the spirit of hybrid automatic repeated request (HARQ)
arXiv Detail & Related papers (2024-08-29T08:53:26Z) - IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception [9.117534139771738]
Multi-agent collaborative perception has emerged as a widely recognized technology in the field of autonomous driving.
Current collaborative perception predominantly relies on LiDAR point clouds, with significantly less attention given to methods using camera images.
This work proposes an instance-level fusion transformer for visual collaborative perception.
arXiv Detail & Related papers (2024-07-13T11:38:15Z) - Pragmatic Communication in Multi-Agent Collaborative Perception [80.14322755297788]
Collaborative perception results in a trade-off between perception ability and communication costs.
We propose PragComm, a multi-agent collaborative perception system with two key components.
PragComm consistently outperforms previous methods with more than 32.7K times lower communication volume.
arXiv Detail & Related papers (2024-01-23T11:58:08Z) - Practical Collaborative Perception: A Framework for Asynchronous and
Multi-Agent 3D Object Detection [9.967263440745432]
Occlusion is a major challenge for LiDAR-based object detection methods.
State-of-the-art V2X methods resolve the performance-bandwidth tradeoff using a mid-collaboration approach.
We devise a simple yet effective collaboration method that achieves a better bandwidth-performance tradeoff than prior methods.
arXiv Detail & Related papers (2023-07-04T03:49:42Z) - Interruption-Aware Cooperative Perception for V2X Communication-Aided
Autonomous Driving [49.42873226593071]
We propose V2X communication INterruption-aware COoperative Perception (V2X-INCOP) for V2X communication-aided autonomous driving.
We use historical cooperation information to recover missing information due to the interruptions and alleviate the impact of the interruption issue.
Experiments on three public cooperative perception datasets demonstrate that the proposed method is effective in alleviating the impacts of communication interruption on cooperative perception.
arXiv Detail & Related papers (2023-04-24T04:59:13Z) - ECO-TR: Efficient Correspondences Finding Via Coarse-to-Fine Refinement [80.94378602238432]
We propose an efficient structure named Correspondence Efficient Transformer (ECO-TR) by finding correspondences in a coarse-to-fine manner.
To achieve this, multiple transformer blocks are stage-wisely connected to gradually refine the predicted coordinates.
Experiments on various sparse and dense matching tasks demonstrate the superiority of our method in both efficiency and effectiveness against existing state-of-the-arts.
arXiv Detail & Related papers (2022-09-25T13:05:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.