HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking
- URL: http://arxiv.org/abs/2507.07603v2
- Date: Mon, 14 Jul 2025 18:18:22 GMT
- Title: HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking
- Authors: Ruixiang Chen, Guolei Sun, Yawei Li, Jie Qin, Luca Benini,
- Abstract summary: This paper presents enhancements to the SAM2 framework for video object tracking task.<n>We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training.
- Score: 48.05251729350641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents enhancements to the SAM2 framework for video object tracking task, addressing challenges such as occlusions, background clutter, and target reappearance. We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training. In addition, we optimize the memory bank by distinguishing long-term and short-term memory frames, enabling more reliable tracking under long-term occlusions and appearance changes. Experimental results show consistent improvements across different model scales. Our method achieves state-of-the-art performance on LaSOT and LaSOText with the large model, achieving 9.6% and 7.2% relative improvements in AUC over the original SAM2, and demonstrates even larger relative gains on smaller models, highlighting the effectiveness of our trainless, low-overhead improvements for boosting long-term tracking performance. The code is available at https://github.com/LouisFinner/HiM2SAM.
Related papers
- SAM2RL: Towards Reinforcement Learning Memory Control in Segment Anything Model 2 [2.659882635924329]
Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks.<n>Recent methods augment SAM 2 with hand-crafted update rules to better handle distractors and object motion.<n>We propose a fundamentally different approach using reinforcement learning for optimizing memory updates in SAM 2.
arXiv Detail & Related papers (2025-07-11T12:53:19Z) - Progressive Scaling Visual Object Tracking [38.28834233600855]
We propose a progressive scaling training strategy for visual object tracking, systematically analyzing the influence of training data volume, model size, and input resolution on tracking performance.<n>Our empirical study reveals that while scaling each factor leads to significant improvements in tracking accuracy, naive training suffers from suboptimal optimization and limited iterative refinement.<n>We introduce DT-Training, a progressive scaling framework that integrates small teacher transfer and dual-branch alignment to maximize model potential.
arXiv Detail & Related papers (2025-05-26T13:45:27Z) - Efficient Transformer for High Resolution Image Motion Deblurring [0.0]
This paper presents a comprehensive study and improvement of the Restormer architecture for high-resolution image motion deblurring.<n>We introduce architectural modifications that reduce model complexity by 18.4% while maintaining or improving performance through optimized attention mechanisms.<n>Our results suggest that thoughtful architectural simplification combined with enhanced training strategies can yield more efficient yet equally capable models for motion deblurring tasks.
arXiv Detail & Related papers (2025-01-30T14:58:33Z) - SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory [23.547018300192065]
This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking.<n>By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning.<n>In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_ext$ and a 3.5% AO gain on GOT-10k.
arXiv Detail & Related papers (2024-11-18T05:59:03Z) - TF-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking [6.91631684487121]
Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences.
We propose a novel memory-based approach that selectively stores critical features based on object motion and overlapping awareness.
Our approach significantly improves over MOTRv2 in the DanceTrack test set, demonstrating a gain of 2.0% AssA score and 2.1% in IDF1 score.
arXiv Detail & Related papers (2024-07-05T07:55:19Z) - RTracker: Recoverable Tracking via PN Tree Structured Memory [71.05904715104411]
We propose a recoverable tracking framework, RTracker, that uses a tree-structured memory to dynamically associate a tracker and a detector to enable self-recovery.
Specifically, we propose a Positive-Negative Tree-structured memory to chronologically store and maintain positive and negative target samples.
Our core idea is to use the support samples of positive and negative target categories to establish a relative distance-based criterion for a reliable assessment of target loss.
arXiv Detail & Related papers (2024-03-28T08:54:40Z) - AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning
Rate and Momentum for Training Deep Neural Networks [76.90477930208982]
Sharpness aware (SAM) has been extensively explored as it can generalize better for training deep neural networks.
Integrating SAM with adaptive learning perturbation and momentum acceleration, dubbed AdaSAM, has already been explored.
We conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMS, and SAMsGrad.
arXiv Detail & Related papers (2023-03-01T15:12:42Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Towards Efficient and Scalable Sharpness-Aware Minimization [81.22779501753695]
We propose a novel algorithm LookSAM that only periodically calculates the inner gradient ascent.
LookSAM achieves similar accuracy gains to SAM while being tremendously faster.
We are the first to successfully scale up the batch size when training Vision Transformers (ViTs)
arXiv Detail & Related papers (2022-03-05T11:53:37Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Object-Adaptive LSTM Network for Real-time Visual Tracking with
Adversarial Data Augmentation [31.842910084312265]
We propose a novel real-time visual tracking method, which adopts an object-adaptive LSTM network to effectively capture the video sequential dependencies and adaptively learn the object appearance variations.
Experiments on four visual tracking benchmarks demonstrate the state-of-the-art performance of our method in terms of both tracking accuracy and speed.
arXiv Detail & Related papers (2020-02-07T03:06:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.