RepViT-SAM: Towards Real-Time Segmenting Anything
- URL: http://arxiv.org/abs/2312.05760v2
- Date: Thu, 29 Feb 2024 05:09:52 GMT
- Title: RepViT-SAM: Towards Real-Time Segmenting Anything
- Authors: Ao Wang, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
- Abstract summary: Segment Anything Model (SAM) has shown impressive zero-shot transfer performance for various computer vision tasks.
MobileSAM proposes to replace the heavyweight image encoder in SAM with TinyViT by employing distillation.
RepViT-SAM can enjoy significantly better zero-shot transfer capability than MobileSAM, along with nearly $10times$ faster inference speed.
- Score: 71.94042743317937
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Segment Anything Model (SAM) has shown impressive zero-shot transfer
performance for various computer vision tasks recently. However, its heavy
computation costs remain daunting for practical applications. MobileSAM
proposes to replace the heavyweight image encoder in SAM with TinyViT by
employing distillation, which results in a significant reduction in
computational requirements. However, its deployment on resource-constrained
mobile devices still encounters challenges due to the substantial memory and
computational overhead caused by self-attention mechanisms. Recently, RepViT
achieves the state-of-the-art performance and latency trade-off on mobile
devices by incorporating efficient architectural designs of ViTs into CNNs.
Here, to achieve real-time segmenting anything on mobile devices, following
MobileSAM, we replace the heavyweight image encoder in SAM with RepViT model,
ending up with the RepViT-SAM model. Extensive experiments show that RepViT-SAM
can enjoy significantly better zero-shot transfer capability than MobileSAM,
along with nearly $10\times$ faster inference speed. The code and models are
available at \url{https://github.com/THU-MIG/RepViT}.
Related papers
- From SAM to SAM 2: Exploring Improvements in Meta's Segment Anything Model [0.5639904484784127]
The Segment Anything Model (SAM) was introduced to the computer vision community by Meta in April 2023.
SAM excels in zero-shot performance, segmenting unseen objects without additional training, stimulated by a large dataset of over one billion image masks.
SAM 2 expands this functionality to video, leveraging memory from preceding and subsequent frames to generate accurate segmentation across entire videos.
arXiv Detail & Related papers (2024-08-12T17:17:35Z) - TS-SAM: Fine-Tuning Segment-Anything Model for Downstream Tasks [10.75125721857487]
There is still a significant performance gap between fine-tuned SAMs and domain-specific models.
We propose Two-Stream SAM (TS-SAM), which integrates the powerful features from SAM into side network training for comprehensive feature fusion.
Extensive experiments on ten public datasets from three tasks demonstrate that TS-SAM not only significantly outperforms the recently proposed SAM-Adapter and SSOM, but achieves competitive performance with the SOTA domain-specific models.
arXiv Detail & Related papers (2024-08-03T18:08:51Z) - EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss [23.428671076019207]
We present EfficientViT-SAM, a new family of accelerated segment anything models.
For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT.
Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measuredRT speedup on A100 GPU over SAM-ViT-H.
arXiv Detail & Related papers (2024-02-07T16:28:36Z) - TinySAM: Pushing the Envelope for Efficient Segment Anything Model [76.21007576954035]
We propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance.
We first propose a full-stage knowledge distillation method with hard prompt sampling and hard mask weighting strategy to distill a lightweight student model.
We also adapt the post-training quantization to the promptable segmentation task and further reduce the computational cost.
arXiv Detail & Related papers (2023-12-21T12:26:11Z) - EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM [71.868623296582]
EdgeSAM is an accelerated variant of the Segment Anything Model (SAM)
Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture.
It is the first SAM variant that can run at over 30 FPS on an iPhone 14.
arXiv Detail & Related papers (2023-12-11T18:59:52Z) - RepViT: Revisiting Mobile CNN From ViT Perspective [67.05569159984691]
lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs)
In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices.
arXiv Detail & Related papers (2023-07-18T14:24:33Z) - Faster Segment Anything: Towards Lightweight SAM for Mobile Applications [47.177751899636164]
This work aims to make Segment Anything Model (SAM) mobile-friendly by replacing the heavyweight image encoder with a lightweight one.
We distill the knowledge from the heavy image encoder to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM.
The resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM.
arXiv Detail & Related papers (2023-06-25T16:37:25Z) - Towards Efficient and Scalable Sharpness-Aware Minimization [81.22779501753695]
We propose a novel algorithm LookSAM that only periodically calculates the inner gradient ascent.
LookSAM achieves similar accuracy gains to SAM while being tremendously faster.
We are the first to successfully scale up the batch size when training Vision Transformers (ViTs)
arXiv Detail & Related papers (2022-03-05T11:53:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.