Related papers: MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking

MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking

URL: http://arxiv.org/abs/2411.15459v1
Date: Sat, 23 Nov 2024 05:31:58 GMT
Title: MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking
Authors: Xinqi Liu, Li Zhou, Zikun Zhou, Jianqiu Chen, Zhenyu He,
Abstract summary: We propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking. In particular, our approach mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information. Our method performs favorably against state-of-the-art trackers across diverse benchmarks.
Score: 8.696516368633143
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular, our approach mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update. Besides, we introduce a modality-selection module that dynamically adjusts the weighting between visual and language references, mitigating potential ambiguities from either reference type. Extensive experimental results show that our method performs favorably against state-of-the-art trackers across diverse benchmarks.

Related papers

GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z)
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling. Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning. We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z)
Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking [53.33637391723555]
We propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target.
arXiv Detail & Related papers (2024-12-20T09:10:17Z)
Context-Enhanced Multi-View Trajectory Representation Learning: Bridging the Gap through Self-Supervised Models [27.316692263196277]
MVTraj is a novel multi-view modeling method for trajectory representation learning. It integrates diverse contextual knowledge, from GPS to road network and points-of-interest to provide a more comprehensive understanding of trajectory data. Extensive experiments on real-world datasets demonstrate that MVTraj significantly outperforms existing baselines in tasks associated with various spatial views.
arXiv Detail & Related papers (2024-10-17T03:56:12Z)
MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking [51.28485682954006]
We propose a pure Mamba-based framework (MambaVT) to fully exploit intrinsic-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations. Experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks.
arXiv Detail & Related papers (2024-08-15T02:29:00Z)
Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion [18.138433117711177]
We propose a novel multimodal hybrid tracker (MMHT) that utilizes frame-event-based data for reliable single object tracking. The MMHT model employs a hybrid backbone consisting of an artificial neural network (ANN) and a spiking neural network (SNN) to extract dominant features from different visual modalities. Extensive experiments demonstrate that the MMHT model exhibits competitive performance in comparison with other state-of-the-art methods.
arXiv Detail & Related papers (2024-05-28T07:24:56Z)
Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs. Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction. We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z)
Multi-Modality Spatio-Temporal Forecasting via Self-Supervised Learning [11.19088022423885]
We propose a novel MoST learning framework via Self-Supervised Learning, namely MoSSL. Results on two real-world MoST datasets verify the superiority of our approach compared with the state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-06T08:24:06Z)
Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion [14.14051929942914]
We argue that long-range contextual semantic information should be extracted in the feature disentanglement stage and the inter-modal semantic information consistency should be maximized in the feature fusion stage. Inspired by recent State Space Models (SSMs), we propose a Broad Mamba, which does not rely on a self-attention mechanism for sequence modeling. We show that the proposed method can overcome the computational and memory limitations of Transformer when modeling long-distance contexts.
arXiv Detail & Related papers (2024-04-27T10:22:03Z)
Veagle: Advancements in Multimodal Representation Learning [0.0]
This paper introduces a novel approach to enhance the multimodal capabilities of existing models. Our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works. Our results indicate a improvement of 5-6 % in performance, with Veagle outperforming existing models by a notable margin.
arXiv Detail & Related papers (2024-01-18T12:45:25Z)
Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications. Traditional methods rely on hand-crafted features and machine learning techniques. We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z)
Learning Temporal Dynamics from Cycles in Narrated Video [85.89096034281694]
We propose a self-supervised solution to the problem of learning to model how the world changes as time elapses. Our model learns modality-agnostic functions to predict forward and backward in time, which must undo each other when composed. We apply the learned dynamics model without further training to various tasks, such as predicting future action and temporally ordering sets of images.
arXiv Detail & Related papers (2021-01-07T02:41:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.