Related papers: TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

URL: http://arxiv.org/abs/2510.07134v1
Date: Wed, 08 Oct 2025 15:29:17 GMT
Title: TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking
Authors: Jiahang Liu, Yunpeng Qi, Jiazhao Zhang, Minghan Li, Shaoan Wang, Kui Wu, Hanjing Ye, Hong Zhang, Zhibo Chen, Fangwei Zhong, Zhizheng Zhang, He Wang,
Abstract summary: We present TrackVLA++, a novel model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and atemporal Identification Memory (TIM)<n>TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings.
Score: 30.955088934475928
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.

Related papers

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z)
DMTrack: Deformable State-Space Modeling for UAV Multi-Object Tracking with Kalman Fusion and Uncertainty-Aware Association [18.68212724411998]
Multi-object tracking (MOT) from unmanned aerial vehicles (UAVs) presents unique challenges due to unpredictable object motion.<n>We propose DMTrack, a deformable motion tracking framework tailored for UAV-based MOT.<n>Our method operates without appearance models and maintains competitive efficiency, highlighting its practicality for robust UAV-based tracking.
arXiv Detail & Related papers (2025-10-15T13:54:25Z)
NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation [54.87964060934928]
Vision-Language-Action (VLA) models confront critical barriers to real-world deployment, most notably catastrophic forgetting.<n>We propose the Narrowing of Trajectory VLA framework: a novel approach that narrows its focus to sparse trajectories.<n>NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints.
arXiv Detail & Related papers (2025-10-04T18:26:55Z)
SocialTrack: Multi-Object Tracking in Complex Urban Traffic Scenes Inspired by Social Behavior [17.501890320034693]
This paper proposes a novel multi-object tracking framework, SocialTrack, to enhance the tracking accuracy and robustness of small targets in complex urban traffic environments.<n>The specialized small-target detector enhances the detection performance by employing a multi-scale feature enhancement mechanism.<n> Extensive experiments on the UAVDT and MOT17 datasets demonstrate that SocialTrack outperforms existing state-of-the-art (SOTA) methods across several key metrics.
arXiv Detail & Related papers (2025-08-18T09:53:32Z)
Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos [58.156141601478794]
Multi-object tracking (UAVT) aims to track multiple objects while maintaining consistent identities across frames of a given video.<n>Existing methods typically model motion cues and appearance separately, overlooking their interplay and resulting in suboptimal tracking performance.<n>We propose AMOT, which exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module.
arXiv Detail & Related papers (2025-08-03T12:06:47Z)
NOVA: Navigation via Object-Centric Visual Autonomy for High-Speed Target Tracking in Unstructured GPS-Denied Environments [56.35569661650558]
We introduce NOVA, a fully onboard, object-centric framework that enables robust target tracking and collision-aware navigation.<n>Rather than constructing a global map, NOVA formulates perception, estimation, and control entirely in the target's reference frame.<n>We validate NOVA across challenging real-world scenarios, including urban mazes, forest trails, and repeated transitions through buildings with intermittent GPS loss.
arXiv Detail & Related papers (2025-06-23T14:28:30Z)
TrackVLA: Embodied Visual Tracking in the Wild [34.03604806748204]
Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision.<n>Existing approaches typically address this challenge through a modular separation of recognition and planning.<n>We propose TrackVLA, a Vision-Language-Action model that learns the synergy between object recognition and trajectory planning.
arXiv Detail & Related papers (2025-05-29T07:28:09Z)
RTracker: Recoverable Tracking via PN Tree Structured Memory [71.05904715104411]
We propose a recoverable tracking framework, RTracker, that uses a tree-structured memory to dynamically associate a tracker and a detector to enable self-recovery. Specifically, we propose a Positive-Negative Tree-structured memory to chronologically store and maintain positive and negative target samples. Our core idea is to use the support samples of positive and negative target categories to establish a relative distance-based criterion for a reliable assessment of target loss.
arXiv Detail & Related papers (2024-03-28T08:54:40Z)
Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for Vision-Language Tracking [3.416427651955299]
Single object tracking aims to locate one specific target in video sequences, given its initial state. Vision-Language (VL) tracking has emerged as a promising approach. We present a novel tracker that progressively explores target-centric semantics for VL tracking.
arXiv Detail & Related papers (2023-11-28T02:28:12Z)
BEVTrack: A Simple and Strong Baseline for 3D Single Object Tracking in Bird's-Eye View [54.48052449493636]
3D Single Object Tracking (SOT) is a fundamental task in computer vision and plays a critical role in applications like autonomous driving.<n>We propose BEVTrack, a simple yet effective motion-based tracking method.<n>We show that BEVTrack achieves state-of-the-art results while operating at 200 FPS, enabling real-time applicability.
arXiv Detail & Related papers (2023-09-05T12:42:26Z)
Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task. Our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.