CollaMamba: Efficient Collaborative Perception with Cross-Agent Spatial-Temporal State Space Model
- URL: http://arxiv.org/abs/2409.07714v3
- Date: Tue, 5 Nov 2024 02:59:08 GMT
- Title: CollaMamba: Efficient Collaborative Perception with Cross-Agent Spatial-Temporal State Space Model
- Authors: Yang Li, Quan Yuan, Guiyang Luo, Xiaoyuan Fu, Xuanhan Zhu, Yujia Yang, Rui Pan, Jinglin Li,
- Abstract summary: Multi-agent collaborative perception fosters a deeper understanding of the environment.
Recent studies on collaborative perception mostly utilize CNNs or Transformers to learn feature representation and fusion in the spatial dimension.
We propose a resource efficient cross-agent spatial-temporal collaborative state space model (SSM), named CollaMamba.
- Score: 12.461378793357705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: By sharing complementary perceptual information, multi-agent collaborative perception fosters a deeper understanding of the environment. Recent studies on collaborative perception mostly utilize CNNs or Transformers to learn feature representation and fusion in the spatial dimension, which struggle to handle long-range spatial-temporal features under limited computing and communication resources. Holistically modeling the dependencies over extensive spatial areas and extended temporal frames is crucial to enhancing feature quality. To this end, we propose a resource efficient cross-agent spatial-temporal collaborative state space model (SSM), named CollaMamba. Initially, we construct a foundational backbone network based on spatial SSM. This backbone adeptly captures positional causal dependencies from both single-agent and cross-agent views, yielding compact and comprehensive intermediate features while maintaining linear complexity. Furthermore, we devise a history-aware feature boosting module based on temporal SSM, extracting contextual cues from extended historical frames to refine vague features while preserving low overhead. Extensive experiments across several datasets demonstrate that CollaMamba outperforms state-of-the-art methods, achieving higher model accuracy while reducing computational and communication overhead by up to 71.9% and 1/64, respectively. This work pioneers the exploration of the Mamba's potential in collaborative perception. The source code will be made available.
Related papers
- Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting [16.782154479264126]
Predicting backbone-temporal traffic flow presents challenges due to complex interactions between temporal factors.
Existing approaches address these dimensions in isolation, neglecting their critical interdependencies.
In this paper, we introduce Sanonymous-Temporal Unitized Unitized Cell (ASTUC), a unified framework designed to capture both spatial and temporal dependencies.
arXiv Detail & Related papers (2024-11-14T07:34:31Z) - PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model [7.286873011001679]
We propose a purely SSM-based approach with linear correlations for complexityD human pose estimation in monocular video video.
Specifically, we propose a bidirectional global temporal-local-temporal block that comprehensively models human joint relations within individual frames as well as across frames.
This strategy provides a more logical geometric ordering strategy, resulting in a combined-local spatial scan.
arXiv Detail & Related papers (2024-08-07T04:38:03Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction [106.06256351200068]
This paper introduces a model learning framework with auxiliary tasks.
In our auxiliary tasks, partial body joints' coordinates are corrupted by either masking or adding noise.
We propose a novel auxiliary-adapted transformer, which can handle incomplete, corrupted motion data.
arXiv Detail & Related papers (2023-08-17T12:26:11Z) - Scalable Multi-agent Covering Option Discovery based on Kronecker Graphs [49.71319907864573]
In this paper, we propose multi-agent skill discovery which enables the ease of decomposition.
Our key idea is to approximate the joint state space as a Kronecker graph, based on which we can directly estimate its Fiedler vector.
Considering that directly computing the Laplacian spectrum is intractable for tasks with infinite-scale state spaces, we further propose a deep learning extension of our method.
arXiv Detail & Related papers (2023-07-21T14:53:12Z) - Spatial-Temporal Graph Convolutional Gated Recurrent Network for Traffic
Forecasting [3.9761027576939414]
We propose a novel framework for traffic forecasting, named Spatial-Temporal Graph Convolutional Gated Recurrent Network (STGCGRN)
We design an attention module to capture long-term dependency by mining periodic information in traffic data.
Experiments on four datasets demonstrate the superior performance of our model.
arXiv Detail & Related papers (2022-10-06T08:02:20Z) - LSTA-Net: Long short-term Spatio-Temporal Aggregation Network for
Skeleton-based Action Recognition [14.078419675904446]
LSTA-Net: a novel short-term Spatio-Temporal Network.
Long/short-term temporal information is not well explored in existing works.
Experiments were conducted on three public benchmark datasets.
arXiv Detail & Related papers (2021-11-01T10:53:35Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z) - ORDNet: Capturing Omni-Range Dependencies for Scene Parsing [135.11360962062957]
We build an Omni-Range Dependencies Network (ORDNet) which can effectively capture short-, middle- and long-range dependencies.
Our ORDNet is able to extract more comprehensive context information and well adapt to complex spatial variance in scene images.
arXiv Detail & Related papers (2021-01-11T14:51:11Z) - A Spatial-Temporal Attentive Network with Spatial Continuity for
Trajectory Prediction [74.00750936752418]
We propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC)
First, spatial-temporal attention mechanism is presented to explore the most useful and important information.
Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity.
arXiv Detail & Related papers (2020-03-13T04:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.