MinVIS: A Minimal Video Instance Segmentation Framework without
Video-based Training
- URL: http://arxiv.org/abs/2208.02245v1
- Date: Wed, 3 Aug 2022 17:50:42 GMT
- Title: MinVIS: A Minimal Video Instance Segmentation Framework without
Video-based Training
- Authors: De-An Huang, Zhiding Yu, Anima Anandkumar
- Abstract summary: MinVIS is a minimal video instance segmentation framework.
It achieves state-of-the-art VIS performance with neither video-based architectures nor training procedures.
- Score: 84.81566912372328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose MinVIS, a minimal video instance segmentation (VIS) framework that
achieves state-of-the-art VIS performance with neither video-based
architectures nor training procedures. By only training a query-based image
instance segmentation model, MinVIS outperforms the previous best result on the
challenging Occluded VIS dataset by over 10% AP. Since MinVIS treats frames in
training videos as independent images, we can drastically sub-sample the
annotated frames in training videos without any modifications. With only 1% of
labeled frames, MinVIS outperforms or is comparable to fully-supervised
state-of-the-art approaches on YouTube-VIS 2019/2021. Our key observation is
that queries trained to be discriminative between intra-frame object instances
are temporally consistent and can be used to track instances without any
manually designed heuristics. MinVIS thus has the following inference pipeline:
we first apply the trained query-based image instance segmentation to video
frames independently. The segmented instances are then tracked by bipartite
matching of the corresponding queries. This inference is done in an online
fashion and does not need to process the whole video at once. MinVIS thus has
the practical advantages of reducing both the labeling costs and the memory
requirements, while not sacrificing the VIS performance. Code is available at:
https://github.com/NVlabs/MinVIS
Related papers
- UVIS: Unsupervised Video Instance Segmentation [65.46196594721545]
Videocaption instance segmentation requires classifying, segmenting, and tracking every object across video frames.
We propose UVIS, a novel Unsupervised Video Instance (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining.
Our framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking.
arXiv Detail & Related papers (2024-06-11T03:05:50Z) - Mask-Free Video Instance Segmentation [102.50936366583106]
Video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets.
We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state.
Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection.
arXiv Detail & Related papers (2023-03-28T11:48:07Z) - BoxVIS: Video Instance Segmentation with Box Annotations [15.082477136581153]
We adapt the state-of-the-art pixel-supervised VIS models to a box-supervised VIS baseline and observe slight performance degradation.
We propose a box-center guided spatial-temporal pairwise affinity loss to predict instance masks for better spatial and temporal consistency.
It exhibits comparable instance mask prediction performance and better generalization ability than state-of-the-art pixel-supervised VIS models by using only 16% of their annotation time and cost.
arXiv Detail & Related papers (2023-03-26T04:04:58Z) - A Generalized Framework for Video Instance Segmentation [49.41441806931224]
The handling of long videos with complex and occluded sequences has emerged as a new challenge in the video instance segmentation (VIS) community.
We propose a Generalized framework for VIS, namely GenVIS, that achieves state-of-the-art performance on challenging benchmarks.
We evaluate our approach on popular VIS benchmarks, achieving state-of-the-art results on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS)
arXiv Detail & Related papers (2022-11-16T11:17:19Z) - DeVIS: Making Deformable Transformers Work for Video Instance
Segmentation [4.3012765978447565]
Video Instance (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences.
Transformers recently allowed to cast the entire VIS task as a single set-prediction problem.
Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored.
arXiv Detail & Related papers (2022-07-22T14:27:45Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - Crossover Learning for Fast Online Video Instance Segmentation [53.5613957875507]
We present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy.
arXiv Detail & Related papers (2021-04-13T06:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.