Related papers: INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception

INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception

URL: http://arxiv.org/abs/2509.23700v1
Date: Sun, 28 Sep 2025 07:16:32 GMT
Title: INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception
Authors: Yunjiang Xu, Lingzhi Li, Jin Wang, Yupeng Ouyang, Benyuan Yang,
Abstract summary: Collaborative perception systems overcome single-vehicle limitations by integrating multi-agent sensory data, improving accuracy and safety.<n>Previous works proves that query-based instance-level interaction reduces bandwidth demands and manual priors, however, LiDAR-focused implementations in collaborative perception remain underdeveloped.<n>We propose INSTINCT, a novel collaborative perception framework featuring three core components: 1) a quality-aware filtering mechanism for high-quality instance feature selection; 2) a dual-branch detection routing scheme to decouple collaboration-irrelevant and collaboration-relevant instances; and 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features.
Score: 6.018757656052237
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Collaborative perception systems overcome single-vehicle limitations in long-range detection and occlusion scenarios by integrating multi-agent sensory data, improving accuracy and safety. However, frequent cooperative interactions and real-time requirements impose stringent bandwidth constraints. Previous works proves that query-based instance-level interaction reduces bandwidth demands and manual priors, however, LiDAR-focused implementations in collaborative perception remain underdeveloped, with performance still trailing state-of-the-art approaches. To bridge this gap, we propose INSTINCT (INSTance-level INteraCtion ArchiTecture), a novel collaborative perception framework featuring three core components: 1) a quality-aware filtering mechanism for high-quality instance feature selection; 2) a dual-branch detection routing scheme to decouple collaboration-irrelevant and collaboration-relevant instances; and 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. Additionally, we enhance the ground truth (GT) sampling technique to facilitate training with diverse hybrid instance features. Extensive experiments across multiple datasets demonstrate that INSTINCT achieves superior performance. Specifically, our method achieves an improvement in accuracy 13.23%/33.08% in DAIR-V2X and V2V4Real while reducing the communication bandwidth to 1/281 and 1/264 compared to state-of-the-art methods. The code is available at https://github.com/CrazyShout/INSTINCT.

Related papers

CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception [9.983779569276475]
Collaborative Alignment and Transformation Network (CATNet) is an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems.<n>Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams.<n>Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions.<n>Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion.
arXiv Detail & Related papers (2026-03-05T15:07:36Z)
MCOP: Multi-UAV Collaborative Occupancy Prediction [40.58729551462363]
Current Bird's Eye View (BEV)-based approaches exhibit two main limitations.<n>We propose a novel multi-UAV collaborative occupancy prediction framework.<n>Our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods.
arXiv Detail & Related papers (2025-10-14T16:17:42Z)
CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception [13.32869419720427]
We propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking.<n>CoopTrack features learnable instance association, which fundamentally differs from existing approaches.<n>Experiments on both the V2X-Seq and Griffin datasets demonstrate that CoopTrack achieves excellent performance.
arXiv Detail & Related papers (2025-07-25T13:04:54Z)
Dynamic Cross-Modal Feature Interaction Network for Hyperspectral and LiDAR Data Classification [66.59320112015556]
Hyperspectral image (HSI) and LiDAR data joint classification is a challenging task.<n>We propose a novel Dynamic Cross-Modal Feature Interaction Network (DCMNet)<n>Our approach introduces three feature interaction blocks: Bilinear Spatial Attention Block (BSAB), Bilinear Channel Attention Block (BCAB), and Integration Convolutional Block (ICB)
arXiv Detail & Related papers (2025-03-10T05:50:13Z)
S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR) Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z)
Practical Collaborative Perception: A Framework for Asynchronous and Multi-Agent 3D Object Detection [9.967263440745432]
Occlusion is a major challenge for LiDAR-based object detection methods. State-of-the-art V2X methods resolve the performance-bandwidth tradeoff using a mid-collaboration approach. We devise a simple yet effective collaboration method that achieves a better bandwidth-performance tradeoff than prior methods.
arXiv Detail & Related papers (2023-07-04T03:49:42Z)
HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot Action Recognition [51.2715005161475]
We propose a novel Hybrid Relation guided temporal Set Matching approach for few-shot action recognition. The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations. We show that our method achieves state-of-the-art performance under various few-shot settings.
arXiv Detail & Related papers (2023-01-09T13:32:50Z)
Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation [103.90033029330527]
Few-Shot Instance (FSIS) requires detecting and segmenting novel classes with limited support examples. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS.
arXiv Detail & Related papers (2023-01-03T15:33:48Z)
Hybrid Relation Guided Set Matching for Few-shot Action Recognition [51.3308583226322]
We propose a novel Hybrid Relation guided Set Matching (HyRSM) approach that incorporates two key components. The purpose of the hybrid relation module is to learn task-specific embeddings by fully exploiting associated relations within and cross videos in an episode. We evaluate HyRSM on six challenging benchmarks, and the experimental results show its superiority over the state-of-the-art methods by a convincing margin.
arXiv Detail & Related papers (2022-04-28T11:43:41Z)
Asynchronous Interaction Aggregation for Action Detection [43.34864954534389]
We propose the Asynchronous Interaction Aggregation network (AIA) that leverages different interactions to boost action detection. There are two key designs in it: one is the Interaction Aggregation structure (IA) adopting a uniform paradigm to model and integrate multiple types of interaction; the other is the Asynchronous Memory Update algorithm (AMU) that enables us to achieve better performance.
arXiv Detail & Related papers (2020-04-16T07:03:20Z)
Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.