TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking
- URL: http://arxiv.org/abs/2510.07134v1
- Date: Wed, 08 Oct 2025 15:29:17 GMT
- Title: TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking
- Authors: Jiahang Liu, Yunpeng Qi, Jiazhao Zhang, Minghan Li, Shaoan Wang, Kui Wu, Hanjing Ye, Hong Zhang, Zhibo Chen, Fangwei Zhong, Zhizheng Zhang, He Wang,
- Abstract summary: We present TrackVLA++, a novel model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and atemporal Identification Memory (TIM)<n>TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings.
- Score: 30.955088934475928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - DMTrack: Deformable State-Space Modeling for UAV Multi-Object Tracking with Kalman Fusion and Uncertainty-Aware Association [18.68212724411998]
Multi-object tracking (MOT) from unmanned aerial vehicles (UAVs) presents unique challenges due to unpredictable object motion.<n>We propose DMTrack, a deformable motion tracking framework tailored for UAV-based MOT.<n>Our method operates without appearance models and maintains competitive efficiency, highlighting its practicality for robust UAV-based tracking.
arXiv Detail & Related papers (2025-10-15T13:54:25Z) - NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation [54.87964060934928]
Vision-Language-Action (VLA) models confront critical barriers to real-world deployment, most notably catastrophic forgetting.<n>We propose the Narrowing of Trajectory VLA framework: a novel approach that narrows its focus to sparse trajectories.<n>NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints.
arXiv Detail & Related papers (2025-10-04T18:26:55Z) - SocialTrack: Multi-Object Tracking in Complex Urban Traffic Scenes Inspired by Social Behavior [17.501890320034693]
This paper proposes a novel multi-object tracking framework, SocialTrack, to enhance the tracking accuracy and robustness of small targets in complex urban traffic environments.<n>The specialized small-target detector enhances the detection performance by employing a multi-scale feature enhancement mechanism.<n> Extensive experiments on the UAVDT and MOT17 datasets demonstrate that SocialTrack outperforms existing state-of-the-art (SOTA) methods across several key metrics.
arXiv Detail & Related papers (2025-08-18T09:53:32Z) - Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos [58.156141601478794]
Multi-object tracking (UAVT) aims to track multiple objects while maintaining consistent identities across frames of a given video.<n>Existing methods typically model motion cues and appearance separately, overlooking their interplay and resulting in suboptimal tracking performance.<n>We propose AMOT, which exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module.
arXiv Detail & Related papers (2025-08-03T12:06:47Z) - NOVA: Navigation via Object-Centric Visual Autonomy for High-Speed Target Tracking in Unstructured GPS-Denied Environments [56.35569661650558]
We introduce NOVA, a fully onboard, object-centric framework that enables robust target tracking and collision-aware navigation.<n>Rather than constructing a global map, NOVA formulates perception, estimation, and control entirely in the target's reference frame.<n>We validate NOVA across challenging real-world scenarios, including urban mazes, forest trails, and repeated transitions through buildings with intermittent GPS loss.
arXiv Detail & Related papers (2025-06-23T14:28:30Z) - TrackVLA: Embodied Visual Tracking in the Wild [34.03604806748204]
Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision.<n>Existing approaches typically address this challenge through a modular separation of recognition and planning.<n>We propose TrackVLA, a Vision-Language-Action model that learns the synergy between object recognition and trajectory planning.
arXiv Detail & Related papers (2025-05-29T07:28:09Z) - RTracker: Recoverable Tracking via PN Tree Structured Memory [71.05904715104411]
We propose a recoverable tracking framework, RTracker, that uses a tree-structured memory to dynamically associate a tracker and a detector to enable self-recovery.
Specifically, we propose a Positive-Negative Tree-structured memory to chronologically store and maintain positive and negative target samples.
Our core idea is to use the support samples of positive and negative target categories to establish a relative distance-based criterion for a reliable assessment of target loss.
arXiv Detail & Related papers (2024-03-28T08:54:40Z) - Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for
Vision-Language Tracking [3.416427651955299]
Single object tracking aims to locate one specific target in video sequences, given its initial state. Vision-Language (VL) tracking has emerged as a promising approach.
We present a novel tracker that progressively explores target-centric semantics for VL tracking.
arXiv Detail & Related papers (2023-11-28T02:28:12Z) - BEVTrack: A Simple and Strong Baseline for 3D Single Object Tracking in Bird's-Eye View [54.48052449493636]
3D Single Object Tracking (SOT) is a fundamental task in computer vision and plays a critical role in applications like autonomous driving.<n>We propose BEVTrack, a simple yet effective motion-based tracking method.<n>We show that BEVTrack achieves state-of-the-art results while operating at 200 FPS, enabling real-time applicability.
arXiv Detail & Related papers (2023-09-05T12:42:26Z) - Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task.
Our proposed framework serializes language description and bounding box into a sequence of discrete tokens.
In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.