Related papers: General and Task-Oriented Video Segmentation

General and Task-Oriented Video Segmentation

URL: http://arxiv.org/abs/2407.06540v1
Date: Tue, 9 Jul 2024 04:21:38 GMT
Title: General and Task-Oriented Video Segmentation
Authors: Mu Chen, Liulei Li, Wenguan Wang, Ruijie Quan, Yi Yang,
Abstract summary: We present GvSeg, a general video segmentation framework for addressing four different video segmentation tasks. GvSeg provides a holistic disentanglement and modeling for segment targets, thoroughly examining them from the perspective of appearance, position, and shape. Extensive experiments on seven gold-standard benchmark datasets demonstrate that GvSeg surpasses all existing specialized/general solutions.
Score: 60.58054218592606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present GvSeg, a general video segmentation framework for addressing four different video segmentation tasks (i.e., instance, semantic, panoptic, and exemplar-guided) while maintaining an identical architectural design. Currently, there is a trend towards developing general video segmentation solutions that can be applied across multiple tasks. This streamlines research endeavors and simplifies deployment. However, such a highly homogenized framework in current design, where each element maintains uniformity, could overlook the inherent diversity among different tasks and lead to suboptimal performance. To tackle this, GvSeg: i) provides a holistic disentanglement and modeling for segment targets, thoroughly examining them from the perspective of appearance, position, and shape, and on this basis, ii) reformulates the query initialization, matching and sampling strategies in alignment with the task-specific requirement. These architecture-agnostic innovations empower GvSeg to effectively address each unique task by accommodating the specific properties that characterize them. Extensive experiments on seven gold-standard benchmark datasets demonstrate that GvSeg surpasses all existing specialized/general solutions by a significant margin on four different video segmentation tasks.

Related papers

Tracking and Segmenting Anything in Any Modality [75.32774085793498]
We propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input.<n> SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.
arXiv Detail & Related papers (2025-11-22T09:09:22Z)
Improving Generalized Visual Grounding with Instance-aware Joint Learning [45.53531162436934]
Generalized visual grounding tasks are designed to accommodate multi-target and non-target scenarios.<n>We propose InstanceVG, a framework equipped with instance-aware capabilities to tackle both GREC and GRES.<n>To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching.
arXiv Detail & Related papers (2025-09-17T07:00:51Z)
GiT: Towards Generalist Vision Transformer through Universal Language Interface [94.33443158125186]
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-03-14T13:47:41Z)
OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs. We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z)
General Object Foundation Model for Images and Videos at Scale [99.2806103051613]
We present GLEE, an object-level foundation model for locating and identifying objects in images and videos. GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario. We employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks.
arXiv Detail & Related papers (2023-12-14T17:26:00Z)
AIMS: All-Inclusive Multi-Level Segmentation [93.5041381700744]
We propose a new task, All-Inclusive Multi-Level (AIMS), which segments visual regions into three levels: part, entity, and relation. We also build a unified AIMS model through multi-dataset multi-task training to address the two major challenges of annotation inconsistency and task correlation.
arXiv Detail & Related papers (2023-05-28T16:28:49Z)
Segment Everything Everywhere All at Once [124.90835636901096]
We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image. We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
arXiv Detail & Related papers (2023-04-13T17:59:40Z)
FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation [42.89720785573885]
FreeSeg is a generic framework to accomplish Unified, Universal and Open-Vocabulary Image. We show that FreeSeg establishes new state-of-the-art results in performance and generalization on three segmentation tasks.
arXiv Detail & Related papers (2023-03-30T08:42:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.