Crossover Learning for Fast Online Video Instance Segmentation
- URL: http://arxiv.org/abs/2104.05970v1
- Date: Tue, 13 Apr 2021 06:47:40 GMT
- Title: Crossover Learning for Fast Online Video Instance Segmentation
- Authors: Shusheng Yang, Yuxin Fang, Xinggang Wang, Yu Li, Chen Fang, Ying Shan,
Bin Feng, Wenyu Liu
- Abstract summary: We present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy.
- Score: 53.5613957875507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling temporal visual context across frames is critical for video instance
segmentation (VIS) and other video understanding tasks. In this paper, we
propose a fast online VIS model named CrossVIS. For temporal information
modeling in VIS, we present a novel crossover learning scheme that uses the
instance feature in the current frame to pixel-wisely localize the same
instance in other frames. Different from previous schemes, crossover learning
does not require any additional network parameters for feature enhancement. By
integrating with the instance segmentation loss, crossover learning enables
efficient cross-frame instance-to-pixel relation learning and brings cost-free
improvement during inference. Besides, a global balanced instance embedding
branch is proposed for more accurate and more stable online instance
association. We conduct extensive experiments on three challenging VIS
benchmarks, \ie, YouTube-VIS-2019, OVIS, and YouTube-VIS-2021 to evaluate our
methods. To our knowledge, CrossVIS achieves state-of-the-art performance among
all online VIS methods and shows a decent trade-off between latency and
accuracy. Code will be available to facilitate future research.
Related papers
- UVIS: Unsupervised Video Instance Segmentation [65.46196594721545]
Videocaption instance segmentation requires classifying, segmenting, and tracking every object across video frames.
We propose UVIS, a novel Unsupervised Video Instance (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining.
Our framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking.
arXiv Detail & Related papers (2024-06-11T03:05:50Z) - NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation [22.200700685751826]
Video Instance (VIS) community operated under the common belief that offline methods are generally superior to a frame by frame online processing.
We present a detailed analysis on different processing paradigms and a new end-to-end Video Instance method.
Our NOVIS represents the first near-online VIS approach which avoids any handcrafted trackings.
arXiv Detail & Related papers (2023-08-29T12:51:04Z) - CTVIS: Consistent Training for Online Video Instance Segmentation [62.957370691452844]
Discrimination of instance embeddings plays a vital role in associating instances across time for online video instance segmentation (VIS)
Recent online VIS methods leverage CIs sourced from one reference frame only, which we argue is insufficient for learning highly discriminative embeddings.
We propose a simple yet effective training strategy, called Consistent Training for Online VIS (CTVIS), which devotes to aligning the training and inference pipelines.
arXiv Detail & Related papers (2023-07-24T08:44:25Z) - A Generalized Framework for Video Instance Segmentation [49.41441806931224]
The handling of long videos with complex and occluded sequences has emerged as a new challenge in the video instance segmentation (VIS) community.
We propose a Generalized framework for VIS, namely GenVIS, that achieves state-of-the-art performance on challenging benchmarks.
We evaluate our approach on popular VIS benchmarks, achieving state-of-the-art results on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS)
arXiv Detail & Related papers (2022-11-16T11:17:19Z) - DeVIS: Making Deformable Transformers Work for Video Instance
Segmentation [4.3012765978447565]
Video Instance (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences.
Transformers recently allowed to cast the entire VIS task as a single set-prediction problem.
Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored.
arXiv Detail & Related papers (2022-07-22T14:27:45Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation [47.28515170195206]
Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video.
Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions.
We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
arXiv Detail & Related papers (2022-02-08T09:34:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.