Mask2Former for Video Instance Segmentation
- URL: http://arxiv.org/abs/2112.10764v1
- Date: Mon, 20 Dec 2021 18:59:59 GMT
- Title: Mask2Former for Video Instance Segmentation
- Authors: Bowen Cheng and Anwesa Choudhuri and Ishan Misra and Alexander
Kirillov and Rohit Girdhar and Alexander G. Schwing
- Abstract summary: Mask2Former achieves state-of-the-art performance on video segmentation instance without modifying the architecture, the loss or even the training pipeline.
We show universal image segmentation architectures trivially generalize to video segmentation by directly predicting 3D segmentation volumes.
- Score: 172.10001340104515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We find Mask2Former also achieves state-of-the-art performance on video
instance segmentation without modifying the architecture, the loss or even the
training pipeline. In this report, we show universal image segmentation
architectures trivially generalize to video segmentation by directly predicting
3D segmentation volumes. Specifically, Mask2Former sets a new state-of-the-art
of 60.4 AP on YouTubeVIS-2019 and 52.6 AP on YouTubeVIS-2021. We believe
Mask2Former is also capable of handling video semantic and panoptic
segmentation, given its versatility in image segmentation. We hope this will
make state-of-the-art video segmentation research more accessible and bring
more attention to designing universal image and video segmentation
architectures.
Related papers
- Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation [12.274092278786966]
Video Panoptic (VPS) aims to simultaneously classify, track, segment all objects in a video.
We propose a robust integrated video panoptic segmentation solution.
Our method achieves state-of-the-art performance with a VPQ score of 56.36 and 57.12 in the development and test phases.
arXiv Detail & Related papers (2024-06-01T17:03:16Z) - RefineVIS: Video Instance Segmentation with Temporal Attention
Refinement [23.720986152136785]
RefineVIS learns two separate representations on top of an off-the-shelf frame-level image instance segmentation model.
A Temporal Attention Refinement (TAR) module learns discriminative segmentation representations by exploiting temporal relationships.
It achieves state-of-the-art video instance segmentation accuracy on YouTube-VIS 2019 (64.4 AP), Youtube-VIS 2021 (61.4 AP), and OVIS (46.1 AP) datasets.
arXiv Detail & Related papers (2023-06-07T20:45:15Z) - Video-kMaX: A Simple Unified Approach for Online and Near-Online Video
Panoptic Segmentation [104.27219170531059]
Video Panoptics (VPS) aims to achieve comprehensive pixel-level scene understanding by segmenting all pixels and associating objects in a video.
Current solutions can be categorized into online and near-online approaches.
We propose a unified approach for online and near-online VPS.
arXiv Detail & Related papers (2023-04-10T16:17:25Z) - Robust Online Video Instance Segmentation with Track Queries [15.834703258232002]
We propose a fully online transformer-based video instance segmentation model that performs comparably to top offline methods on the YouTube-VIS 2019 benchmark.
We show that, when combined with a strong enough image segmentation architecture, track queries can exhibit impressive accuracy while not being constrained to short videos.
arXiv Detail & Related papers (2022-11-16T18:50:14Z) - Guess What Moves: Unsupervised Video and Image Segmentation by
Anticipating Motion [92.80981308407098]
We propose an approach that combines the strengths of motion-based and appearance-based segmentation.
We propose to supervise an image segmentation network, tasking it with predicting regions that are likely to contain simple motion patterns.
In the unsupervised video segmentation mode, the network is trained on a collection of unlabelled videos, using the learning process itself as an algorithm to segment these videos.
arXiv Detail & Related papers (2022-05-16T17:55:34Z) - Masked-attention Mask Transformer for Universal Image Segmentation [180.73009259614494]
We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic)
Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions.
In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets.
arXiv Detail & Related papers (2021-12-02T18:59:58Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - VideoClick: Video Object Segmentation with a Single Click [93.7733828038616]
We propose a bottom up approach where given a single click for each object in a video, we obtain the segmentation masks of these objects in the full video.
In particular, we construct a correlation volume that assigns each pixel in a target frame to either one of the objects in the reference frame or the background.
Results on this new CityscapesVideo dataset show that our approach outperforms all the baselines in this challenging setting.
arXiv Detail & Related papers (2021-01-16T23:07:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.