Related papers: CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception

CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception

URL: http://arxiv.org/abs/2507.19239v1
Date: Fri, 25 Jul 2025 13:04:54 GMT
Title: CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception
Authors: Jiaru Zhong, Jiahao Wang, Jiahui Xu, Xiaofan Li, Zaiqing Nie, Haibao Yu,
Abstract summary: We propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking.<n>CoopTrack features learnable instance association, which fundamentally differs from existing approaches.<n>Experiments on both the V2X-Seq and Griffin datasets demonstrate that CoopTrack achieves excellent performance.
Score: 13.32869419720427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cooperative perception aims to address the inherent limitations of single-vehicle autonomous driving systems through information exchange among multiple agents. Previous research has primarily focused on single-frame perception tasks. However, the more challenging cooperative sequential perception tasks, such as cooperative 3D multi-object tracking, have not been thoroughly investigated. Therefore, we propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking, featuring learnable instance association, which fundamentally differs from existing approaches. CoopTrack transmits sparse instance-level features that significantly enhance perception capabilities while maintaining low transmission costs. Furthermore, the framework comprises two key components: Multi-Dimensional Feature Extraction, and Cross-Agent Association and Aggregation, which collectively enable comprehensive instance representation with semantic and motion features, and adaptive cross-agent association and fusion based on a feature graph. Experiments on both the V2X-Seq and Griffin datasets demonstrate that CoopTrack achieves excellent performance. Specifically, it attains state-of-the-art results on V2X-Seq, with 39.0\% mAP and 32.8\% AMOTA. The project is available at https://github.com/zhongjiaru/CoopTrack.

Related papers

What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception [52.41695608928129]
Multi-agent perception (MAP) allows autonomous systems to understand complex environments by interpreting data from multiple sources. This paper investigates intermediate collaboration for MAP with a specific focus on exploring "good" properties of collaborative view. We propose a novel framework named CMiMC for intermediate collaboration.
arXiv Detail & Related papers (2024-03-15T07:18:55Z)
V2X-AHD:Vehicle-to-Everything Cooperation Perception via Asymmetric Heterogenous Distillation Network [13.248981195106069]
We propose a multi-view vehicle-road cooperation perception system, vehicle-to-everything cooperative perception (V2X-AHD) The V2X-AHD can effectively improve the accuracy of 3D object detection and reduce the number of network parameters, according to this study.
arXiv Detail & Related papers (2023-10-10T13:12:03Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)
Practical Collaborative Perception: A Framework for Asynchronous and Multi-Agent 3D Object Detection [9.967263440745432]
Occlusion is a major challenge for LiDAR-based object detection methods. State-of-the-art V2X methods resolve the performance-bandwidth tradeoff using a mid-collaboration approach. We devise a simple yet effective collaboration method that achieves a better bandwidth-performance tradeoff than prior methods.
arXiv Detail & Related papers (2023-07-04T03:49:42Z)
A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks. These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation. Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z)
VINet: Lightweight, Scalable, and Heterogeneous Cooperative Perception for 3D Object Detection [15.195933965761645]
Cooperative Perception (CP) has emerged to significantly advance the perception of automated driving. We introduce VINet, a unified deep learning-based CP network for scalable, lightweight, and heterogeneous cooperative 3D object detection. VINet can reduce 84% system-level computational cost and 94% system-level communication cost while improving the 3D detection accuracy.
arXiv Detail & Related papers (2022-12-14T07:03:23Z)
A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [59.21990697929617]
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world. Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other. We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
arXiv Detail & Related papers (2022-03-09T13:35:19Z)
CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video. Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects. We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z)
LiDAR-based Panoptic Segmentation via Dynamic Shifting Network [56.71765153629892]
LiDAR-based panoptic segmentation aims to parse both objects and scenes in a unified manner. We propose the Dynamic Shifting Network (DS-Net), which serves as an effective panoptic segmentation framework in the point cloud realm. Our proposed DS-Net achieves superior accuracies over current state-of-the-art methods.
arXiv Detail & Related papers (2020-11-24T08:44:46Z)
End-to-End 3D Multi-Object Tracking and Trajectory Forecasting [34.68114553744956]
We propose a unified solution for 3D MOT and trajectory forecasting. We employ a feature interaction technique by introducing Graph Neural Networks. We also use a diversity sampling function to improve the quality and diversity of our forecasted trajectories.
arXiv Detail & Related papers (2020-08-25T16:54:46Z)
Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.