Related papers: OMG-Seg: Is One Model Good Enough For All Segmentation?

OMG-Seg: Is One Model Good Enough For All Segmentation?

URL: http://arxiv.org/abs/2401.10229v2
Date: Tue, 01 Oct 2024 05:56:05 GMT
Title: OMG-Seg: Is One Model Good Enough For All Segmentation?
Authors: Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, Chen Change Loy,
Abstract summary: OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs. We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
Score: 83.17068644513144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.

Related papers

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models [2.331828779757202]
We present CALICO, the first Large Vision-Language Models (LVLM) designed for multi-image part-level reasoning segmentation. CALICO features two key components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Adaptation Correspondence Modules that embed this information into the LVLM. We show that CALICO, with just 0.3% of its parameters finetuned, achieves strong performance on this challenging task.
arXiv Detail & Related papers (2024-12-26T18:59:37Z)
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation [14.534308478766476]
We introduce ViCaS, a new dataset containing thousands of challenging videos. Our benchmark evaluates models on holistic/high-level understanding and language-guided, pixel-precise segmentation.
arXiv Detail & Related papers (2024-12-12T23:10:54Z)
RAP-SAM: Towards Real-Time All-Purpose Segment Anything [120.17175256421622]
Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation. Current real-time segmentation mainly has one purpose, such as semantic segmentation on the driving scene. This work explores a new real-time segmentation setting, named all-purpose segmentation in real-time, to transfer VFMs in real-time deployment.
arXiv Detail & Related papers (2024-01-18T18:59:30Z)
You Only Look at Once for Real-time and Generic Multi-Task [20.61477620156465]
A-YOLOM is an adaptive, real-time, and lightweight multi-task model. We develop an end-to-end multi-task model with a unified and streamlined segmentation structure. We achieve competitive results on the BDD100k dataset.
arXiv Detail & Related papers (2023-10-02T21:09:43Z)
Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA) It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z)
AIMS: All-Inclusive Multi-Level Segmentation [93.5041381700744]
We propose a new task, All-Inclusive Multi-Level (AIMS), which segments visual regions into three levels: part, entity, and relation. We also build a unified AIMS model through multi-dataset multi-task training to address the two major challenges of annotation inconsistency and task correlation.
arXiv Detail & Related papers (2023-05-28T16:28:49Z)
BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video [58.71785546245467]
Multiple existing benchmarks involve tracking and segmenting objects in video. There is little interaction between them due to the use of disparate benchmark datasets and metrics. We propose BURST, a dataset which contains thousands of diverse videos with high-quality object masks. All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison.
arXiv Detail & Related papers (2022-09-25T01:27:35Z)
Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation. We introduce a scalable pipeline for generating synthetic training data with multiple objects. We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z)
Evolution of Image Segmentation using Deep Convolutional Neural Network: A Survey [0.0]
We take a glance at the evolution of both semantic and instance segmentation work based on CNN. We have given a glimpse of some state-of-the-art panoptic segmentation models.
arXiv Detail & Related papers (2020-01-13T06:07:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.