Distractor-Aware Memory-Based Visual Object Tracking
- URL: http://arxiv.org/abs/2509.13864v1
- Date: Wed, 17 Sep 2025 09:54:27 GMT
- Title: Distractor-Aware Memory-Based Visual Object Tracking
- Authors: Jovana Videnovic, Matej Kristan, Alan Lukezic,
- Abstract summary: We propose a distractor-aware drop-in memory module and introspection-based management method for SAM2.<n>Our design effectively reduces the tracking drift toward distractors and improves redetection capability after object occlusion.<n>We show DAM4SAM outperforms SAM2.1 on thirteen benchmarks and sets new state-of-the-art results on ten.
- Score: 17.945503249662675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent emergence of memory-based video segmentation methods such as SAM2 has led to models with excellent performance in segmentation tasks, achieving leading results on numerous benchmarks. However, these modes are not fully adjusted for visual object tracking, where distractors (i.e., objects visually similar to the target) pose a key challenge. In this paper we propose a distractor-aware drop-in memory module and introspection-based management method for SAM2, leading to DAM4SAM. Our design effectively reduces the tracking drift toward distractors and improves redetection capability after object occlusion. To facilitate the analysis of tracking in the presence of distractors, we construct DiDi, a Distractor-Distilled dataset. DAM4SAM outperforms SAM2.1 on thirteen benchmarks and sets new state-of-the-art results on ten. Furthermore, integrating the proposed distractor-aware memory into a real-time tracker EfficientTAM leads to 11% improvement and matches tracking quality of the non-real-time SAM2.1-L on multiple tracking and segmentation benchmarks, while integration with edge-based tracker EdgeTAM delivers 4% performance boost, demonstrating a very good generalization across architectures.
Related papers
- Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval [22.632907736085034]
Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks.<n>We propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations.<n>With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set.
arXiv Detail & Related papers (2026-02-09T02:58:33Z) - Seg2Track-SAM2: SAM2-based Multi-object Tracking and Segmentation for Zero-shot Generalization [3.108551551357326]
Seg2Track-SAM2 is a framework that integrates pre-trained object detectors with SAM2 and a novel Seg2Track module.<n>Seg2Track-SAM2 achieves state-of-the-art (SOTA) performance, ranking fourth overall in both car and pedestrian classes on KITTI MOTS.<n>Results confirm that Seg2Track-SAM2 advances MOTS by combining robust zero-shot tracking, enhanced identity preservation, and efficient memory utilization.
arXiv Detail & Related papers (2025-09-15T10:52:27Z) - SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking [58.35852822355312]
Visual Object Tracking (VOT) is widely used in applications like autonomous driving to continuously track targets in videos.<n>To address these issues, some methods propose to adapt the video foundation model SAM2 for VOT, where the tracking results of each frame would be encoded as memory for conditioning the rest of frames in an autoregressive manner.<n>We present a SAMITE model, built upon SAM2 with additional modules, to tackle these challenges.
arXiv Detail & Related papers (2025-07-29T12:11:56Z) - HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking [48.52769927193921]
This paper presents enhancements to the SAM2 framework for video object tracking task.<n>We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training.
arXiv Detail & Related papers (2025-07-10T10:05:11Z) - SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation [11.1906749425206]
Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation.<n>We propose SAM2MOT, a novel Tracking by paradigm for multi-object tracking.<n> SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy.
arXiv Detail & Related papers (2025-04-06T15:32:08Z) - A Distractor-Aware Memory for Visual Object Tracking with SAM2 [11.864619292028278]
Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames.<n> SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them.
arXiv Detail & Related papers (2024-11-26T16:41:09Z) - SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory [23.547018300192065]
This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking.<n>By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning.<n>In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_ext$ and a 3.5% AO gain on GOT-10k.
arXiv Detail & Related papers (2024-11-18T05:59:03Z) - TF-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking [6.91631684487121]
Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences.
We propose a novel memory-based approach that selectively stores critical features based on object motion and overlapping awareness.
Our approach significantly improves over MOTRv2 in the DanceTrack test set, demonstrating a gain of 2.0% AssA score and 2.1% in IDF1 score.
arXiv Detail & Related papers (2024-07-05T07:55:19Z) - FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models.
We propose FocSAM with a pipeline redesigned on two pivotal aspects.
First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object.
Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z) - RTracker: Recoverable Tracking via PN Tree Structured Memory [71.05904715104411]
We propose a recoverable tracking framework, RTracker, that uses a tree-structured memory to dynamically associate a tracker and a detector to enable self-recovery.
Specifically, we propose a Positive-Negative Tree-structured memory to chronologically store and maintain positive and negative target samples.
Our core idea is to use the support samples of positive and negative target categories to establish a relative distance-based criterion for a reliable assessment of target loss.
arXiv Detail & Related papers (2024-03-28T08:54:40Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - Online Multiple Object Tracking with Cross-Task Synergy [120.70085565030628]
We propose a novel unified model with synergy between position prediction and embedding association.
The two tasks are linked by temporal-aware target attention and distractor attention, as well as identity-aware memory aggregation model.
arXiv Detail & Related papers (2021-04-01T10:19:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.