IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception
- URL: http://arxiv.org/abs/2407.09857v1
- Date: Sat, 13 Jul 2024 11:38:15 GMT
- Title: IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception
- Authors: Shaohong Wang, Lu Bin, Xinyu Xiao, Zhiyu Xiang, Hangguan Shan, Eryun Liu,
- Abstract summary: Multi-agent collaborative perception has emerged as a widely recognized technology in the field of autonomous driving.
Current collaborative perception predominantly relies on LiDAR point clouds, with significantly less attention given to methods using camera images.
This work proposes an instance-level fusion transformer for visual collaborative perception.
- Score: 9.117534139771738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-agent collaborative perception has emerged as a widely recognized technology in the field of autonomous driving in recent years. However, current collaborative perception predominantly relies on LiDAR point clouds, with significantly less attention given to methods using camera images. This severely impedes the development of budget-constrained collaborative systems and the exploitation of the advantages offered by the camera modality. This work proposes an instance-level fusion transformer for visual collaborative perception (IFTR), which enhances the detection performance of camera-only collaborative perception systems through the communication and sharing of visual features. To capture the visual information from multiple agents, we design an instance feature aggregation that interacts with the visual features of individual agents using predefined grid-shaped bird eye view (BEV) queries, generating more comprehensive and accurate BEV features. Additionally, we devise a cross-domain query adaptation as a heuristic to fuse 2D priors, implicitly encoding the candidate positions of targets. Furthermore, IFTR optimizes communication efficiency by sending instance-level features, achieving an optimal performance-bandwidth trade-off. We evaluate the proposed IFTR on a real dataset, DAIR-V2X, and two simulated datasets, OPV2V and V2XSet, achieving performance improvements of 57.96%, 9.23% and 12.99% in AP@70 metrics compared to the previous SOTAs, respectively. Extensive experiments demonstrate the superiority of IFTR and the effectiveness of its key components. The code is available at https://github.com/wangsh0111/IFTR.
Related papers
- UVCPNet: A UAV-Vehicle Collaborative Perception Network for 3D Object Detection [11.60579201022641]
We propose a framework specifically designed for aerial-ground collaboration.
We develop a virtual dataset named V2U-COO for our research.
Second, we design a Cross-Domain Cross-Adaptation (CDCA) module to align the target information.
Third, we introduce a Collaborative Depth Optimization (CDO) module to obtain more precise depth estimation results.
arXiv Detail & Related papers (2024-06-07T05:25:45Z) - BEV$^2$PR: BEV-Enhanced Visual Place Recognition with Structural Cues [44.96177875644304]
We propose a new image-based visual place recognition (VPR) framework by exploiting the structural cues in bird's-eye view (BEV) from a single camera.
The BEV$2$PR framework generates a composite descriptor with both visual cues and spatial awareness based on a single camera.
arXiv Detail & Related papers (2024-03-11T10:46:43Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by
Visual-Semantic Fusion for Egocentric Action Anticipation [33.41226268323332]
Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions in the first-person view.
Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network.
We propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework.
arXiv Detail & Related papers (2023-07-08T06:49:54Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z) - CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse
Transformers [36.838065731893735]
CoBEVT is the first generic multi-agent perception framework that can cooperatively generate BEV map predictions.
CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation.
arXiv Detail & Related papers (2022-07-05T17:59:28Z) - Masked Transformer for Neighhourhood-aware Click-Through Rate Prediction [74.52904110197004]
We propose Neighbor-Interaction based CTR prediction, which put this task into a Heterogeneous Information Network (HIN) setting.
In order to enhance the representation of the local neighbourhood, we consider four types of topological interaction among the nodes.
We conduct comprehensive experiments on two real world datasets and the experimental results show that our proposed method outperforms state-of-the-art CTR models significantly.
arXiv Detail & Related papers (2022-01-25T12:44:23Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.