Related papers: A Distractor-Aware Memory for Visual Object Tracking with SAM2

A Distractor-Aware Memory for Visual Object Tracking with SAM2

URL: http://arxiv.org/abs/2411.17576v1
Date: Tue, 26 Nov 2024 16:41:09 GMT
Title: A Distractor-Aware Memory for Visual Object Tracking with SAM2
Authors: Jovana Videnovic, Alan Lukezic, Matej Kristan,
Abstract summary: Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them.
Score: 11.864619292028278
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory-based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor-distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them.

Related papers

SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking [58.35852822355312]
Visual Object Tracking (VOT) is widely used in applications like autonomous driving to continuously track targets in videos.<n>To address these issues, some methods propose to adapt the video foundation model SAM2 for VOT, where the tracking results of each frame would be encoded as memory for conditioning the rest of frames in an autoregressive manner.<n>We present a SAMITE model, built upon SAM2 with additional modules, to tackle these challenges.
arXiv Detail & Related papers (2025-07-29T12:11:56Z)
SAM2RL: Towards Reinforcement Learning Memory Control in Segment Anything Model 2 [2.659882635924329]
Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks.<n>Recent methods augment SAM 2 with hand-crafted update rules to better handle distractors and object motion.<n>We propose a fundamentally different approach using reinforcement learning for optimizing memory updates in SAM 2.
arXiv Detail & Related papers (2025-07-11T12:53:19Z)
MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection [21.22536962888316]
We present MoSAM, incorporating two key strategies to integrate object motion cues into the model and establish more reliable feature memory.<n>MoSAM achieves state-of-the-art results compared to other competitors.
arXiv Detail & Related papers (2025-04-30T02:19:31Z)
SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation [11.1906749425206]
Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation. We propose SAM2MOT, a novel Tracking by paradigm for multi-object tracking. SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy.
arXiv Detail & Related papers (2025-04-06T15:32:08Z)
CamSAM2: Segment Anything Accurately in Camouflaged Videos [37.0152845263844]
We propose Camouflaged SAM2 (CamSAM2) to handle camouflaged scenes without modifying SAM2's parameters. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets.
arXiv Detail & Related papers (2025-03-25T14:58:52Z)
EdgeTAM: On-Device Track Anything Model [65.10032957471824]
Segment Anything Model (SAM) 2 further extends its capability from image to video inputs through a memory bank mechanism. We aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. We propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost.
arXiv Detail & Related papers (2025-01-13T12:11:07Z)
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory [23.547018300192065]
This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_ext$ and a 3.5% AO gain on GOT-10k.
arXiv Detail & Related papers (2024-11-18T05:59:03Z)
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree [79.26409013413003]
We introduce SAM2Long, an improved training-free video object segmentation strategy. It considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways. SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons.
arXiv Detail & Related papers (2024-10-21T17:59:19Z)
From SAM to SAM 2: Exploring Improvements in Meta's Segment Anything Model [0.5639904484784127]
The Segment Anything Model (SAM) was introduced to the computer vision community by Meta in April 2023. SAM excels in zero-shot performance, segmenting unseen objects without additional training, stimulated by a large dataset of over one billion image masks. SAM 2 expands this functionality to video, leveraging memory from preceding and subsequent frames to generate accurate segmentation across entire videos.
arXiv Detail & Related papers (2024-08-12T17:17:35Z)
TF-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking [6.91631684487121]
Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences. We propose a novel memory-based approach that selectively stores critical features based on object motion and overlapping awareness. Our approach significantly improves over MOTRv2 in the DanceTrack test set, demonstrating a gain of 2.0% AssA score and 2.1% in IDF1 score.
arXiv Detail & Related papers (2024-07-05T07:55:19Z)
FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models. We propose FocSAM with a pipeline redesigned on two pivotal aspects. First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object. Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z)
Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size. We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos. We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z)
Learning Dynamic Compact Memory Embedding for Deformable Visual Object Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method. Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z)
Online Multiple Object Tracking with Cross-Task Synergy [120.70085565030628]
We propose a novel unified model with synergy between position prediction and embedding association. The two tasks are linked by temporal-aware target attention and distractor attention, as well as identity-aware memory aggregation model.
arXiv Detail & Related papers (2021-04-01T10:19:40Z)
DMV: Visual Object Tracking via Part-level Dense Memory and Voting-based Retrieval [61.366644088881735]
We propose a novel memory-based tracker via part-level dense memory and voting-based retrieval, called DMV. We also propose a novel voting mechanism for the memory reading to filter out unreliable information in the memory.
arXiv Detail & Related papers (2020-03-20T10:05:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.