Slot-VPS: Object-centric Representation Learning for Video Panoptic
Segmentation
- URL: http://arxiv.org/abs/2112.08949v1
- Date: Thu, 16 Dec 2021 15:12:22 GMT
- Title: Slot-VPS: Object-centric Representation Learning for Video Panoptic
Segmentation
- Authors: Yi Zhou, Hui Zhang, Hana Lee, Shuyang Sun, Pingjun Li, Yangguang Zhu,
ByungIn Yoo, Xiaojuan Qi, Jae-Joon Han
- Abstract summary: Video Panoptic (VPS) aims at assigning a class label to each pixel, uniquely segmenting and identifying all object instances consistently across all frames.
We present Slot-VPS, the first end-to-end framework for this task.
We encode all panoptic entities in a video, including instances and background semantics, with a unified representation called panoptic slots.
The coherent-temporal object's information is retrieved and encoded into the panoptic slots by proposed the Video Panoptic Retriever, enabling it to localize, segment, differentiate, and associate objects in a unified manner.
- Score: 29.454785969084384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Panoptic Segmentation (VPS) aims at assigning a class label to each
pixel, uniquely segmenting and identifying all object instances consistently
across all frames. Classic solutions usually decompose the VPS task into
several sub-tasks and utilize multiple surrogates (e.g. boxes and masks,
centres and offsets) to represent objects. However, this divide-and-conquer
strategy requires complex post-processing in both spatial and temporal domains
and is vulnerable to failures from surrogate tasks. In this paper, inspired by
object-centric learning which learns compact and robust object representations,
we present Slot-VPS, the first end-to-end framework for this task. We encode
all panoptic entities in a video, including both foreground instances and
background semantics, with a unified representation called panoptic slots. The
coherent spatio-temporal object's information is retrieved and encoded into the
panoptic slots by the proposed Video Panoptic Retriever, enabling it to
localize, segment, differentiate, and associate objects in a unified manner.
Finally, the output panoptic slots can be directly converted into the class,
mask, and object ID of panoptic objects in the video. We conduct extensive
ablation studies and demonstrate the effectiveness of our approach on two
benchmark datasets, Cityscapes-VPS (\textit{val} and test sets) and VIPER
(\textit{val} set), achieving new state-of-the-art performance of 63.7, 63.3
and 56.2 VPQ, respectively.
Related papers
- 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - OW-VISCap: Open-World Video Instance Segmentation and Captioning [95.6696714640357]
We propose an approach to jointly segment, track, and caption previously seen or unseen objects in a video.
We generate rich descriptive and object-centric captions for each detected object via a masked attention augmented LLM input.
Our approach matches or surpasses state-of-the-art on three tasks.
arXiv Detail & Related papers (2024-04-04T17:59:58Z) - Rethinking Amodal Video Segmentation from Learning Supervised Signals
with Object-centric Representation [47.39455910191075]
Video amodal segmentation is a challenging task in computer vision.
Recent studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting.
This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation.
arXiv Detail & Related papers (2023-09-23T04:12:02Z) - Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z) - Towards Robust Video Object Segmentation with Adaptive Object
Calibration [18.094698623128146]
Video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames.
We propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness.
Our model achieves the state-of-the-art performance among existing published works, and also exhibits superior robustness against perturbations.
arXiv Detail & Related papers (2022-07-02T17:51:29Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Merging Tasks for Video Panoptic Segmentation [0.0]
Video panoptic segmentation (VPS) is a recently introduced computer vision task that requires classifying and tracking every pixel in a given video.
To understand video panoptic segmentation, first, earlier introduced constituent tasks that focus on semantics and tracking separately will be researched.
Two data-driven approaches which do not require training on a tailored dataset will be selected to solve it.
arXiv Detail & Related papers (2021-07-10T08:46:42Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.