Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts
- URL: http://arxiv.org/abs/2404.00741v1
- Date: Sun, 31 Mar 2024 17:02:24 GMT
- Title: Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts
- Authors: Qin Liu, Jaemin Cho, Mohit Bansal, Marc Niethammer,
- Abstract summary: Low-latency and high-quality interactive segmentation with diverse prompts are challenging for specialist and generalist models.
We propose SegNext, a next-generation interactive segmentation approach offering low latency, high quality, and diverse prompt support.
Our method outperforms current state-of-the-art methods on HQSeg-44K and DAVIS, both quantitatively and qualitatively.
- Score: 68.86537322287474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of interactive image segmentation is to delineate specific regions within an image via visual or language prompts. Low-latency and high-quality interactive segmentation with diverse prompts remain challenging for existing specialist and generalist models. Specialist models, with their limited prompts and task-specific designs, experience high latency because the image must be recomputed every time the prompt is updated, due to the joint encoding of image and visual prompts. Generalist models, exemplified by the Segment Anything Model (SAM), have recently excelled in prompt diversity and efficiency, lifting image segmentation to the foundation model era. However, for high-quality segmentations, SAM still lags behind state-of-the-art specialist models despite SAM being trained with x100 more segmentation masks. In this work, we delve deep into the architectural differences between the two types of models. We observe that dense representation and fusion of visual prompts are the key design choices contributing to the high segmentation quality of specialist models. In light of this, we reintroduce this dense design into the generalist models, to facilitate the development of generalist models with high segmentation quality. To densely represent diverse visual prompts, we propose to use a dense map to capture five types: clicks, boxes, polygons, scribbles, and masks. Thus, we propose SegNext, a next-generation interactive segmentation approach offering low latency, high quality, and diverse prompt support. Our method outperforms current state-of-the-art methods on HQSeg-44K and DAVIS, both quantitatively and qualitatively.
Related papers
- Few-Shot Medical Image Segmentation with High-Fidelity Prototypes [38.073371773707514]
We propose a novel Detail Self-refined Prototype Network (DSPNet) to construct high-fidelity prototypes representing the object foreground and the background more comprehensively.
To construct global semantics while maintaining the captured detail semantics, we learn the foreground prototypes by modelling the multi-modal structures with clustering and then fusing each in a channel-wise manner.
arXiv Detail & Related papers (2024-06-26T05:06:14Z) - VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model [76.02314305164595]
This work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users.
We take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image.
In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is elaborately designed to enhance further the the interaction between specific space regions of the image and corresponding parts of the text prompts.
arXiv Detail & Related papers (2024-06-03T07:14:19Z) - IFSENet : Harnessing Sparse Iterations for Interactive Few-shot Segmentation Excellence [2.822194296769473]
Few-shot segmentation techniques reduce the required number of images to learn to segment a new class.
interactive segmentation techniques only focus on incrementally improving the segmentation of one object at a time.
We combine the two concepts to drastically reduce the effort required to train segmentation models for novel classes.
arXiv Detail & Related papers (2024-03-22T10:15:53Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - A Simple and Robust Framework for Cross-Modality Medical Image
Segmentation applied to Vision Transformers [0.0]
We propose a simple framework to achieve fair image segmentation of multiple modalities using a single conditional model.
We show that our framework outperforms other cross-modality segmentation methods on the Multi-Modality Whole Heart Conditional Challenge.
arXiv Detail & Related papers (2023-10-09T09:51:44Z) - Self-Prompting Large Vision Models for Few-Shot Medical Image
Segmentation [14.135249795318591]
We propose a novel perspective on self-prompting in medical vision applications.
We harness the embedding space of the Segment Anything Model to prompt itself through a simple yet effective linear pixel-wise classifier.
We achieve competitive results on multiple datasets.
arXiv Detail & Related papers (2023-08-15T08:20:07Z) - Segment Everything Everywhere All at Once [124.90835636901096]
We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image.
We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks.
We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
arXiv Detail & Related papers (2023-04-13T17:59:40Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Prototype Mixture Models for Few-shot Semantic Segmentation [50.866870384596446]
Few-shot segmentation is challenging because objects within the support and query images could significantly differ in appearance and pose.
We propose prototype mixture models (PMMs), which correlate diverse image regions with multiple prototypes to enforce the prototype-based semantic representation.
PMMs improve 5-shot segmentation performance on MS-COCO by up to 5.82% with only a moderate cost for model size and inference speed.
arXiv Detail & Related papers (2020-08-10T04:33:17Z) - Evolution of Image Segmentation using Deep Convolutional Neural Network:
A Survey [0.0]
We take a glance at the evolution of both semantic and instance segmentation work based on CNN.
We have given a glimpse of some state-of-the-art panoptic segmentation models.
arXiv Detail & Related papers (2020-01-13T06:07:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.