RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception
- URL: http://arxiv.org/abs/2501.16803v1
- Date: Tue, 28 Jan 2025 09:08:31 GMT
- Title: RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception
- Authors: Lantao Li, Kang Yang, Wenqi Zhang, Xiaoxue Wang, Chen Sun,
- Abstract summary: Vehicle-to-Everything (V2X) communication offers an optimal solution to overcome the perception limitations of single-agent systems.
We propose two different architectures, named Paint-To-Puzzle (PTP) and Co-Sketching-Co-Co, for conducting cooperative perception.
Our approach achieves state-of-the-art (SOTA) performance on both real and simulated cooperative perception datasets.
- Score: 12.90369816793173
- License:
- Abstract: Cooperative perception offers an optimal solution to overcome the perception limitations of single-agent systems by leveraging Vehicle-to-Everything (V2X) communication for data sharing and fusion across multiple agents. However, most existing approaches focus on single-modality data exchange, limiting the potential of both homogeneous and heterogeneous fusion across agents. This overlooks the opportunity to utilize multi-modality data per agent, restricting the system's performance. In the automotive industry, manufacturers adopt diverse sensor configurations, resulting in heterogeneous combinations of sensor modalities across agents. To harness the potential of every possible data source for optimal performance, we design a robust LiDAR and camera cross-modality fusion module, Radian-Glue-Attention (RG-Attn), applicable to both intra-agent cross-modality fusion and inter-agent cross-modality fusion scenarios, owing to the convenient coordinate conversion by transformation matrix and the unified sampling/inversion mechanism. We also propose two different architectures, named Paint-To-Puzzle (PTP) and Co-Sketching-Co-Coloring (CoS-CoCo), for conducting cooperative perception. PTP aims for maximum precision performance and achieves smaller data packet size by limiting cross-agent fusion to a single instance, but requiring all participants to be equipped with LiDAR. In contrast, CoS-CoCo supports agents with any configuration-LiDAR-only, camera-only, or LiDAR-camera-both, presenting more generalization ability. Our approach achieves state-of-the-art (SOTA) performance on both real and simulated cooperative perception datasets. The code will be released at GitHub in early 2025.
Related papers
- AgentAlign: Misalignment-Adapted Multi-Agent Perception for Resilient Inter-Agent Sensor Correlations [8.916036880001734]
Existing research overlooks the fragile multi-sensor correlations in multi-agent settings.
AgentAlign is a real-world heterogeneous agent cross-modality feature alignment framework.
We present a novel V2XSet-noise dataset that simulates realistic sensor imperfections under diverse environmental conditions.
arXiv Detail & Related papers (2024-12-09T01:51:18Z) - CoMiX: Cross-Modal Fusion with Deformable Convolutions for HSI-X Semantic Segmentation [10.26122715098048]
CoMiX is an asymmetric encoder-decoder architecture with deformable convolutions (DCNs) for HSI-X semantic segmentation.
CoMiX is designed to extract, calibrate, and fuse information from HSI and X data.
arXiv Detail & Related papers (2024-11-13T21:00:28Z) - Cooperative and Asynchronous Transformer-based Mission Planning for Heterogeneous Teams of Mobile Robots [1.1049608786515839]
We propose a Cooperative and Asynchronous Transformer-based Mission Planning (CATMiP) framework to coordinate distributed decision making among agents.
We evaluate CATMiP in a 2D grid-world simulation environment and compare its performance against planning-based exploration methods.
arXiv Detail & Related papers (2024-10-08T21:14:09Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception [52.41695608928129]
Multi-agent perception (MAP) allows autonomous systems to understand complex environments by interpreting data from multiple sources.
This paper investigates intermediate collaboration for MAP with a specific focus on exploring "good" properties of collaborative view.
We propose a novel framework named CMiMC for intermediate collaboration.
arXiv Detail & Related papers (2024-03-15T07:18:55Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - MACP: Efficient Model Adaptation for Cooperative Perception [23.308578463976804]
We propose a new framework termed MACP, which equips a single-agent pre-trained model with cooperation capabilities.
We demonstrate in experiments that the proposed framework can effectively utilize cooperative observations and outperform other state-of-the-art approaches.
arXiv Detail & Related papers (2023-10-25T14:24:42Z) - BM2CP: Efficient Collaborative Perception with LiDAR-Camera Modalities [5.034692611033509]
We propose a collaborative perception paradigm, BM2CP, which employs LiDAR and camera to achieve efficient multi-modal perception.
It is capable to cope with the special case where one of the sensors, same or different type, of any agent is missing.
Our approach outperforms the state-of-the-art methods with 50X lower communication volumes in both simulated and real-world autonomous driving scenarios.
arXiv Detail & Related papers (2023-10-23T08:45:12Z) - Plug-and-Play Regulators for Image-Text Matching [76.28522712930668]
Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching.
We develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations.
Experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models.
arXiv Detail & Related papers (2023-03-23T15:42:05Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Learning Deep Multimodal Feature Representation with Asymmetric
Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network.
We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder.
Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.