RAP-SAM: Towards Real-Time All-Purpose Segment Anything
- URL: http://arxiv.org/abs/2401.10228v1
- Date: Thu, 18 Jan 2024 18:59:30 GMT
- Title: RAP-SAM: Towards Real-Time All-Purpose Segment Anything
- Authors: Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang,
Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan
Yang
- Abstract summary: Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation.
Current real-time segmentation mainly has one purpose, such as semantic segmentation on the driving scene.
This work explores a new real-time segmentation setting, named all-purpose segmentation in real-time, to transfer VFMs in real-time deployment.
- Score: 120.17175256421622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advanced by transformer architecture, vision foundation models (VFMs) achieve
remarkable progress in performance and generalization ability. Segment Anything
Model (SAM) is one remarkable model that can achieve generalized segmentation.
However, most VFMs cannot run in realtime, which makes it difficult to transfer
them into several products. On the other hand, current real-time segmentation
mainly has one purpose, such as semantic segmentation on the driving scene. We
argue that diverse outputs are needed for real applications. Thus, this work
explores a new real-time segmentation setting, named all-purpose segmentation
in real-time, to transfer VFMs in real-time deployment. It contains three
different tasks, including interactive segmentation, panoptic segmentation, and
video segmentation. We aim to use one model to achieve the above tasks in
real-time. We first benchmark several strong baselines. Then, we present
Real-Time All Purpose SAM (RAP-SAM). It contains an efficient encoder and an
efficient decoupled decoder to perform prompt-driven decoding. Moreover, we
further explore different training strategies and tuning methods to boost
co-training performance further. Our code and model are available at
https://github.com/xushilin1/RAP-SAM/.
Related papers
- SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation [51.90445260276897]
We prove that the Segment Anything Model 2 (SAM2) can be a strong encoder for U-shaped segmentation models.
We propose a simple but effective framework, termed SAM2-UNet, for versatile image segmentation.
arXiv Detail & Related papers (2024-08-16T17:55:38Z) - SAM 2: Segment Anything in Images and Videos [63.44869623822368]
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos.
We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.
Our model is a simple transformer architecture with streaming memory for real-time video processing.
arXiv Detail & Related papers (2024-08-01T17:00:08Z) - Moving Object Segmentation: All You Need Is SAM (and Flow) [82.78026782967959]
We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects.
In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt.
These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks.
arXiv Detail & Related papers (2024-04-18T17:59:53Z) - The revenge of BiSeNet: Efficient Multi-Task Image Segmentation [6.172605433695617]
BiSeNetFormer is a novel architecture for efficient multi-task image segmentation.
By seamlessly supporting multiple tasks, BiSeNetFormer offers a versatile solution for multi-task segmentation.
Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks.
arXiv Detail & Related papers (2024-04-15T08:32:18Z) - OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs.
We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z) - TinySAM: Pushing the Envelope for Efficient Segment Anything Model [76.21007576954035]
We propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance.
We first propose a full-stage knowledge distillation method with hard prompt sampling and hard mask weighting strategy to distill a lightweight student model.
We also adapt the post-training quantization to the promptable segmentation task and further reduce the computational cost.
arXiv Detail & Related papers (2023-12-21T12:26:11Z) - You Only Look at Once for Real-time and Generic Multi-Task [20.61477620156465]
A-YOLOM is an adaptive, real-time, and lightweight multi-task model.
We develop an end-to-end multi-task model with a unified and streamlined segmentation structure.
We achieve competitive results on the BDD100k dataset.
arXiv Detail & Related papers (2023-10-02T21:09:43Z) - Fast Segment Anything [46.130784421779865]
Recently proposed segment anything model (SAM) has made a significant influence in many computer vision tasks.
Huge computation costs prevent it from wider applications in industry scenarios.
We propose a speed-up alternative method for this fundamental task with comparable performance.
arXiv Detail & Related papers (2023-06-21T10:08:29Z) - DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive
Segmentation Transformer [58.95404214273222]
Most state-of-the-art instance segmentation methods rely on large amounts of pixel-precise ground-truth for training.
We introduce a more efficient approach, called DynaMITe, in which we represent user interactions as-temporal queries.
Our architecture also alleviates any need to re-compute image features during refinement, and requires fewer interactions for segmenting multiple instances in a single image.
arXiv Detail & Related papers (2023-04-13T16:57:02Z) - BURST: A Benchmark for Unifying Object Recognition, Segmentation and
Tracking in Video [58.71785546245467]
Multiple existing benchmarks involve tracking and segmenting objects in video.
There is little interaction between them due to the use of disparate benchmark datasets and metrics.
We propose BURST, a dataset which contains thousands of diverse videos with high-quality object masks.
All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison.
arXiv Detail & Related papers (2022-09-25T01:27:35Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.