Efficient Track Anything
- URL: http://arxiv.org/abs/2411.18933v1
- Date: Thu, 28 Nov 2024 05:52:10 GMT
- Title: Efficient Track Anything
- Authors: Yunyang Xiong, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Balakrishnan Varadarajan, Ramya Akula, Forrest Iandola, Raghuraman Krishnamoorthi, Bilge Soran, Vikas Chandra,
- Abstract summary: Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything.
High complexity of multistage image encoder and memory module has limited its applications in real-world tasks.
We propose EfficientTAMs, lightweight track anything models that produce high-quality results with low latency and model size.
- Score: 32.0480507678127
- License:
- Abstract: Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of multistage image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight track anything models that produce high-quality results with low latency and model size. Our idea is based on revisiting the plain, nonhierarchical Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi-supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with vanilla ViT perform comparably to SAM 2 model (HieraB+SAM 2) with ~2x speedup on A100 and ~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ~20x speedup on A100 and ~20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAMs can run at ~10 FPS for performing video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.
Related papers
- EdgeTAM: On-Device Track Anything Model [65.10032957471824]
Segment Anything Model (SAM) 2 further extends its capability from image to video inputs through a memory bank mechanism.
We aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance.
We propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost.
arXiv Detail & Related papers (2025-01-13T12:11:07Z) - Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset.
We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation.
In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.
Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.
By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z) - SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree [79.26409013413003]
We introduce SAM2Long, an improved training-free video object segmentation strategy.
It considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways.
SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons.
arXiv Detail & Related papers (2024-10-21T17:59:19Z) - Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track [28.52754012142431]
Segment Anything Model 2 (SAM 2) is a foundation model towards solving promptable visual segmentation in images and videos.
SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.
Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
arXiv Detail & Related papers (2024-08-19T16:13:14Z) - SAM 2: Segment Anything in Images and Videos [63.44869623822368]
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos.
We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.
Our model is a simple transformer architecture with streaming memory for real-time video processing.
arXiv Detail & Related papers (2024-08-01T17:00:08Z) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage.
For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation.
For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z) - Dual Temporal Memory Network for Efficient Video Object Segmentation [42.05305410986511]
One of the fundamental challenges in Video Object (VOS) is how to make the most use of the temporal information to boost the performance.
We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories.
Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network.
arXiv Detail & Related papers (2020-03-13T06:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.