Related papers: Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval

Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval

URL: http://arxiv.org/abs/2602.08224v2
Date: Tue, 10 Feb 2026 08:41:46 GMT
Title: Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval
Authors: Jing Zhang, Zhikai Li, Xuewen Liu, Qingyi Gu,
Abstract summary: Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks.<n>We propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations.<n>With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set.
Score: 22.632907736085034
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full-token computation redundant. With these insights, we propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations, thereby significantly improving inference efficiency. Specifically, for image encoder, we propose object-aware Sparse Window Routing (SWR), a window-level computation allocation mechanism that leverages the consistency and saliency cues from the previous-frame decoder to route background regions into a lightweight shortcut branch. Moreover, for memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation, with the saliency pattern reused from their first recollection. With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set.

Related papers

Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z)
Evaluating SAM2 for Video Semantic Segmentation [60.157605818225186]
The Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos.<n>This paper explores the extension of SAM2 to dense Video Semantic (VSS)<n>Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
arXiv Detail & Related papers (2025-12-01T15:15:16Z)
Distractor-Aware Memory-Based Visual Object Tracking [17.945503249662675]
We propose a distractor-aware drop-in memory module and introspection-based management method for SAM2.<n>Our design effectively reduces the tracking drift toward distractors and improves redetection capability after object occlusion.<n>We show DAM4SAM outperforms SAM2.1 on thirteen benchmarks and sets new state-of-the-art results on ten.
arXiv Detail & Related papers (2025-09-17T09:54:27Z)
EdgeTAM: On-Device Track Anything Model [65.10032957471824]
Segment Anything Model (SAM) 2 further extends its capability from image to video inputs through a memory bank mechanism.<n>We aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance.<n>We propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost.
arXiv Detail & Related papers (2025-01-13T12:11:07Z)
FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models. We propose FocSAM with a pipeline redesigned on two pivotal aspects. First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object. Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z)
SAM-Lightening: A Lightweight Segment Anything Model with Dilated Flash Attention to Achieve 30 times Acceleration [6.515075311704396]
Segment Anything Model (SAM) has garnered significant attention in segmentation tasks due to their zero-shot generalization ability. We introduce SAM-Lightening, a variant of SAM, that features a re-engineered attention mechanism, termed Dilated Flash Attention. Experiments on COCO and LVIS reveal that SAM-Lightening significantly outperforms the state-of-the-art methods in both run-time efficiency and segmentation accuracy.
arXiv Detail & Related papers (2024-03-14T09:07:34Z)
TinySAM: Pushing the Envelope for Efficient Segment Anything Model [73.06322749886483]
We propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance.<n>With all these proposed methods, our TinySAM leads to orders of magnitude computational reduction and pushes the envelope for efficient segment anything task.
arXiv Detail & Related papers (2023-12-21T12:26:11Z)
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything [36.553867358541154]
Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. We propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning.
arXiv Detail & Related papers (2023-12-01T18:31:00Z)
Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage. For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation. For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.