Related papers: RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception

RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception

URL: http://arxiv.org/abs/2501.16803v3
Date: Wed, 24 Sep 2025 07:11:50 GMT
Title: RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception
Authors: Lantao Li, Kang Yang, Wenqi Zhang, Xiaoxue Wang, Chen Sun,
Abstract summary: Radian Glue Attention (RG-Attn) is a lightweight and generalizable cross-modal fusion module.<n>RG-Attn efficiently aligns features through a radian-based attention constraint.<n>Paint-To-Puzzle (PTP) prioritizes communication efficiency but assumes all agents have a camera.<n>CoS-CoCo offers maximal flexibility, supporting any sensor setup.<n>Pyramid-RG-Attn Fusion (PRGAF) aims for peak detection accuracy with the highest computational overhead.
Score: 14.450341173771486
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cooperative perception enhances autonomous driving by leveraging Vehicle-to-Everything (V2X) communication for multi-agent sensor fusion. However, most existing methods rely on single-modal data sharing, limiting fusion performance, particularly in heterogeneous sensor settings involving both LiDAR and cameras across vehicles and roadside units (RSUs). To address this, we propose Radian Glue Attention (RG-Attn), a lightweight and generalizable cross-modal fusion module that unifies intra-agent and inter-agent fusion via transformation-based coordinate alignment and a unified sampling/inversion strategy. RG-Attn efficiently aligns features through a radian-based attention constraint, operating column-wise on geometrically consistent regions to reduce overhead and preserve spatial coherence, thereby enabling accurate and robust fusion. Building upon RG-Attn, we propose three cooperative architectures. The first, Paint-To-Puzzle (PTP), prioritizes communication efficiency but assumes all agents have LiDAR, optionally paired with cameras. The second, Co-Sketching-Co-Coloring (CoS-CoCo), offers maximal flexibility, supporting any sensor setup (e.g., LiDAR-only, camera-only, or both) and enabling strong cross-modal generalization for real-world deployment. The third, Pyramid-RG-Attn Fusion (PRGAF), aims for peak detection accuracy with the highest computational overhead. Extensive evaluations on simulated and real-world datasets show our framework delivers state-of-the-art detection accuracy with high flexibility and efficiency. GitHub Link: https://github.com/LantaoLi/RG-Attn

Related papers

HeCoFuse: Cross-Modal Complementary V2X Cooperative Perception with Heterogeneous Sensors [10.154689913045447]
HeCoFuse is a unified framework designed for cooperative perception across mixed sensor setups.<n>HeCoFuse can tackle critical challenges such as cross-modality feature misalignment and imbalanced representation quality.<n> Experiments on the real-world TUMTraf-V2X dataset demonstrate that HeCoFuse achieves 43.22% 3D mAP.
arXiv Detail & Related papers (2025-07-18T06:02:22Z)
AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction [70.60422261117816]
We propose a new framework that rethinks multi-agent coordination through a sequential structure rather than a graph structure.<n>Our method focuses on two key directions: (1) Next-Agent Prediction, which selects the most suitable agent role at each step, and (2) Next-Context Selection, which enables each agent to selectively access relevant information from any previous step.
arXiv Detail & Related papers (2025-06-21T18:34:43Z)
Is Discretization Fusion All You Need for Collaborative Perception? [5.44403620979893]
This paper proposes a novel Anchor-Centric paradigm for Collaborative Object detection (ACCO) It avoids grid precision issues and allows more flexible and efficient anchor-centric communication and fusion. Comprehensive experiments are conducted to evaluate ACCO on OPV2V and Dair-V2X datasets.
arXiv Detail & Related papers (2025-03-18T06:25:03Z)
Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well [23.460400679372714]
MultiCOS is a novel framework that effectively leverages diverse data modalities to improve segmentation performance.<n>BFSer outperforms existing multimodal baselines with both real and pseudo-modal data.
arXiv Detail & Related papers (2025-02-20T11:49:50Z)
AgentAlign: Misalignment-Adapted Multi-Agent Perception for Resilient Inter-Agent Sensor Correlations [8.916036880001734]
Existing research overlooks the fragile multi-sensor correlations in multi-agent settings.<n>AgentAlign is a real-world heterogeneous agent cross-modality feature alignment framework.<n>We present a novel V2XSet-noise dataset that simulates realistic sensor imperfections under diverse environmental conditions.
arXiv Detail & Related papers (2024-12-09T01:51:18Z)
CoMiX: Cross-Modal Fusion with Deformable Convolutions for HSI-X Semantic Segmentation [10.26122715098048]
CoMiX is an asymmetric encoder-decoder architecture with deformable convolutions (DCNs) for HSI-X semantic segmentation. CoMiX is designed to extract, calibrate, and fuse information from HSI and X data.
arXiv Detail & Related papers (2024-11-13T21:00:28Z)
FlatFusion: Delving into Details of Sparse Transformer-based Camera-LiDAR Fusion for Autonomous Driving [63.96049803915402]
The integration of data from diverse sensor modalities constitutes a prevalent methodology within the ambit of autonomous driving scenarios.<n>Recent advancements in efficient point cloud transformers have underscored the efficacy of integrating information in sparse formats.<n>In this paper, we conduct a comprehensive exploration of design choices for Transformer-based sparse cameraLiDAR fusion.
arXiv Detail & Related papers (2024-08-13T11:46:32Z)
DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout. DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z)
What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception [52.41695608928129]
Multi-agent perception (MAP) allows autonomous systems to understand complex environments by interpreting data from multiple sources. This paper investigates intermediate collaboration for MAP with a specific focus on exploring "good" properties of collaborative view. We propose a novel framework named CMiMC for intermediate collaboration.
arXiv Detail & Related papers (2024-03-15T07:18:55Z)
AgentScope: A Flexible yet Robust Multi-Agent Platform [66.64116117163755]
AgentScope is a developer-centric multi-agent platform with message exchange as its core communication mechanism. The abundant syntactic tools, built-in agents and service functions, user-friendly interfaces for application demonstration and utility monitor, zero-code programming workstation, and automatic prompt tuning mechanism significantly lower the barriers to both development and deployment.
arXiv Detail & Related papers (2024-02-21T04:11:28Z)
Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z)
MACP: Efficient Model Adaptation for Cooperative Perception [23.308578463976804]
We propose a new framework termed MACP, which equips a single-agent pre-trained model with cooperation capabilities. We demonstrate in experiments that the proposed framework can effectively utilize cooperative observations and outperform other state-of-the-art approaches.
arXiv Detail & Related papers (2023-10-25T14:24:42Z)
BM2CP: Efficient Collaborative Perception with LiDAR-Camera Modalities [5.034692611033509]
We propose a collaborative perception paradigm, BM2CP, which employs LiDAR and camera to achieve efficient multi-modal perception. It is capable to cope with the special case where one of the sensors, same or different type, of any agent is missing. Our approach outperforms the state-of-the-art methods with 50X lower communication volumes in both simulated and real-world autonomous driving scenarios.
arXiv Detail & Related papers (2023-10-23T08:45:12Z)
LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment [63.83894701779067]
We propose LCPS, the first LiDAR-Camera Panoptic network. In our approach, we conduct LiDAR-Camera fusion in three stages. Our fusion strategy improves about 6.9% PQ performance over the LiDAR-only baseline on NuScenes dataset.
arXiv Detail & Related papers (2023-08-03T10:57:58Z)
Plug-and-Play Regulators for Image-Text Matching [76.28522712930668]
Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. We develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models.
arXiv Detail & Related papers (2023-03-23T15:42:05Z)
A Generalized Multi-Modal Fusion Detection Framework [7.951044844083936]
LiDAR point clouds have become the most common data source in autonomous driving. Due to the sparsity of point clouds, accurate and reliable detection cannot be achieved in specific scenarios. We propose a generic 3D detection framework called MMFusion, using multi-modal features.
arXiv Detail & Related papers (2023-03-13T12:38:07Z)
FusionRCNN: LiDAR-Camera Fusion for Two-stage 3D Object Detection [11.962073589763676]
Existing 3D detectors significantly improve the accuracy by adopting a two-stage paradigm. The sparsity of point clouds, especially for the points far away, makes it difficult for the LiDAR-only refinement module to accurately recognize and locate objects. We propose a novel multi-modality two-stage approach named FusionRCNN, which effectively and efficiently fuses point clouds and camera images in the Regions of Interest(RoI) FusionRCNN significantly improves the strong SECOND baseline by 6.14% mAP on baseline, and outperforms competing two-stage approaches.
arXiv Detail & Related papers (2022-09-22T02:07:25Z)
TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers [49.689566246504356]
We propose TransFusion, a robust solution to LiDAR-camera fusion with a soft-association mechanism to handle inferior image conditions. TransFusion achieves state-of-the-art performance on large-scale datasets. We extend the proposed method to the 3D tracking task and achieve the 1st place in the leaderboard of nuScenes tracking.
arXiv Detail & Related papers (2022-03-22T07:15:13Z)
Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities. We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement. Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z)
Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network. We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder. Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z)
LiDAR-based Panoptic Segmentation via Dynamic Shifting Network [56.71765153629892]
LiDAR-based panoptic segmentation aims to parse both objects and scenes in a unified manner. We propose the Dynamic Shifting Network (DS-Net), which serves as an effective panoptic segmentation framework in the point cloud realm. Our proposed DS-Net achieves superior accuracies over current state-of-the-art methods.
arXiv Detail & Related papers (2020-11-24T08:44:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.