Related papers: HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

URL: http://arxiv.org/abs/2505.15703v1
Date: Wed, 21 May 2025 16:16:52 GMT
Title: HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning
Authors: Xiaodong Mei, Sheng Wang, Jie Cheng, Yingbing Chen, Dan Xu,
Abstract summary: We propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly.<n>We show that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.
Score: 12.568968115955865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents' future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines self-attention and cross-attention mechanisms to model the scene context information and aggregate future motion features jointly. Complementing the encoder, we implement the Mamba module in the decoding stage to further preserve the consistency and correlations among the learned future motion representations, to generate the accurate and diverse final trajectories. Extensive experiments on Argoverse 2 benchmark demonstrate that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.

Related papers

GC-GAT: Multimodal Vehicular Trajectory Prediction using Graph Goal Conditioning and Cross-context Attention [0.0]
We present a lane graph-based motion prediction model that first predicts graph-based goal proposals and later fuses them with cross attention over multiple contextual elements.<n>We evaluate our work on nuScenes motion prediction dataset, achieving state-of-the-art results.
arXiv Detail & Related papers (2025-04-15T12:53:07Z)
Future-Aware Interaction Network For Motion Forecasting [10.211526610529374]
We propose an interaction-based method, named Future-Aware Interaction Network, that introduces potential future trajectories into scene encoding.<n>To adapt Mamba for spatial interaction modeling, we propose an adaptive reordering strategy that transforms unordered data into a structured sequence.<n>Mamba is employed to refine generated future trajectories temporally, ensuring more consistent predictions.
arXiv Detail & Related papers (2025-03-09T11:38:34Z)
Flow-guided Motion Prediction with Semantics and Dynamic Occupancy Grid Maps [5.9803668726235575]
Occupancy Grid Maps (OGMs) are commonly employed for scene prediction. Recent studies have successfully combined OGMs with deep learning methods to predict the evolution of scene. We propose a novel multi-task framework that leverages dynamic OGMs and semantic information to predict both future vehicle semantic grids and the future flow of the scene.
arXiv Detail & Related papers (2024-07-22T14:42:34Z)
MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information. Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z)
AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving [59.94343412438211]
We introduce the GPT style next token motion prediction into motion prediction. Different from language data which is composed of homogeneous units -words, the elements in the driving scene could have complex spatial-temporal and semantic relations. We propose to adopt three factorized attention modules with different neighbors for information aggregation and different position encoding styles to capture their relations.
arXiv Detail & Related papers (2024-03-20T06:22:37Z)
FIMP: Future Interaction Modeling for Multi-Agent Motion Prediction [18.10147252674138]
We propose Future Interaction modeling for Motion Prediction (FIMP), which captures potential future interactions in an end-to-end manner. Experiments show that our future interaction modeling improves the performance remarkably, leading to superior performance on the Argoverse motion forecasting benchmark.
arXiv Detail & Related papers (2024-01-29T14:41:55Z)
MTR++: Multi-Agent Motion Prediction with Symmetric Scene Modeling and Guided Intention Querying [110.83590008788745]
Motion prediction is crucial for autonomous driving systems to understand complex driving scenarios and make informed decisions. In this paper, we propose Motion TRansformer (MTR) frameworks to address these challenges. The initial MTR framework utilizes a transformer encoder-decoder structure with learnable intention queries. We introduce an advanced MTR++ framework, extending the capability of MTR to simultaneously predict multimodal motion for multiple agents.
arXiv Detail & Related papers (2023-06-30T16:23:04Z)
Decoder Fusion RNN: Context and Interaction Aware Decoders for Trajectory Prediction [53.473846742702854]
We propose a recurrent, attention-based approach for motion forecasting. Decoder Fusion RNN (DF-RNN) is composed of a recurrent behavior encoder, an inter-agent multi-headed attention module, and a context-aware decoder. We demonstrate the efficacy of our method by testing it on the Argoverse motion forecasting dataset and show its state-of-the-art performance on the public benchmark.
arXiv Detail & Related papers (2021-08-12T15:53:37Z)
Learning Semantic-Aware Dynamics for Video Prediction [68.04359321855702]
We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions. The appearance of the scene is warped from past frames using the predicted motion in co-visible regions.
arXiv Detail & Related papers (2021-04-20T05:00:24Z)
Implicit Latent Variable Model for Scene-Consistent Motion Forecasting [78.74510891099395]
In this paper, we aim to learn scene-consistent motion forecasts of complex urban traffic directly from sensor data. We model the scene as an interaction graph and employ powerful graph neural networks to learn a distributed latent representation of the scene.
arXiv Detail & Related papers (2020-07-23T14:31:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.