Related papers: GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

URL: http://arxiv.org/abs/2602.14771v1
Date: Mon, 16 Feb 2026 14:26:07 GMT
Title: GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture
Authors: Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin,
Abstract summary: We propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models.<n>We further propose Occur to enhance occlusion perception for object tracking.
Score: 27.70912792107499
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

Related papers

MASAR: Motion-Appearance Synergy Refinement for Joint Detection and Trajectory Forecasting [2.681087131751672]
MASAR is a novel framework for joint 3D detection trajectory forecasting compatible with any transformer-based 3D detector.<n>By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting.
arXiv Detail & Related papers (2026-02-13T15:11:50Z)
Visual Autoregressive Modelling for Monocular Depth Estimation [69.01449528371916]
We propose a monocular depth estimation method based on visual autoregressive ( VAR) priors.<n>Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism.<n>We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets.
arXiv Detail & Related papers (2025-12-27T17:08:03Z)
Generative Point Tracking with Flow Matching [32.15342097497571]
We introduce Generative Point Tracker (GenPT), a generative framework for modelling multi-modal trajectories.<n>GenPT is trained with a novel flow matching formulation that combines the iterative refinement of discriminative trackers.<n>We show how our model's generative capabilities can be leveraged to improve point trajectory estimates.
arXiv Detail & Related papers (2025-10-23T19:25:14Z)
SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization [62.958457694151384]
We introduce preference optimization into precipitation nowcasting for the first time, motivated by the success of reinforcement learning from human feedback in large language models.<n>In the first stage, the framework focuses on reducing FAR, training the model to effectively suppress false alarms.
arXiv Detail & Related papers (2025-10-22T16:11:22Z)
ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving [64.42138266293202]
ResAD is a Normalized Residual Trajectory Modeling framework.<n>It reframes the learning task to predict the residual deviation from an inertial reference.<n>On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy.
arXiv Detail & Related papers (2025-10-09T17:59:36Z)
Head Anchor Enhanced Detection and Association for Crowded Pedestrian Tracking [8.653608112604472]
The proposed method incorporates detection features from both the regression and classification branches of an object detector.<n>In terms of motion modeling, we propose an iterative Kalman filtering approach designed to align with modern detector assumptions.
arXiv Detail & Related papers (2025-08-07T15:47:34Z)
RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z)
Self-supervised Multi-future Occupancy Forecasting for Autonomous Driving [31.995016095663544]
LiDAR-generated occupancy grid maps (L-OGMs) offer a robust bird's-eye view for the scene representation.<n>Our proposed framework, Latent Occupancy Prediction (LOPR), performs L-OGM prediction in the latent space of a generative architecture.
arXiv Detail & Related papers (2024-07-30T18:37:59Z)
RTracker: Recoverable Tracking via PN Tree Structured Memory [71.05904715104411]
We propose a recoverable tracking framework, RTracker, that uses a tree-structured memory to dynamically associate a tracker and a detector to enable self-recovery. Specifically, we propose a Positive-Negative Tree-structured memory to chronologically store and maintain positive and negative target samples. Our core idea is to use the support samples of positive and negative target categories to establish a relative distance-based criterion for a reliable assessment of target loss.
arXiv Detail & Related papers (2024-03-28T08:54:40Z)
LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry [53.5449912019877]
We present the Long-term Effective Any Point Tracking (LEAP) module.<n>LEAP innovatively combines visual, inter-track, and temporal cues with mindfully selected anchors for dynamic track estimation.<n>Based on these traits, we develop LEAP-VO, a robust visual odometry system adept at handling occlusions and dynamic scenes.
arXiv Detail & Related papers (2024-01-03T18:57:27Z)
STDepthFormer: Predicting Spatio-temporal Depth from Video with a Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed. The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods. It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z)
PnPNet: End-to-End Perception and Prediction with Tracking in the Loop [82.97006521937101]
We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. We propose Net, an end-to-end model that takes as input sensor data, and outputs at each time step object tracks and their future level.
arXiv Detail & Related papers (2020-05-29T17:57:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.