Instance As Identity: A Generic Online Paradigm for Video Instance
Segmentation
- URL: http://arxiv.org/abs/2208.03079v1
- Date: Fri, 5 Aug 2022 10:29:30 GMT
- Title: Instance As Identity: A Generic Online Paradigm for Video Instance
Segmentation
- Authors: Feng Zhu and Zongxin Yang and Xin Yu and Yi Yang and Yunchao Wei
- Abstract summary: We propose a new online VIS paradigm named Instance As Identity (IAI)
IAI models temporal information for both detection and tracking in an efficient way.
We conduct extensive experiments on three VIS benchmarks.
- Score: 84.3695480773597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling temporal information for both detection and tracking in a unified
framework has been proved a promising solution to video instance segmentation
(VIS). However, how to effectively incorporate the temporal information into an
online model remains an open problem. In this work, we propose a new online VIS
paradigm named Instance As Identity (IAI), which models temporal information
for both detection and tracking in an efficient way. In detail, IAI employs a
novel identification module to predict identification number for tracking
instances explicitly. For passing temporal information cross frame, IAI
utilizes an association module which combines current features and past
embeddings. Notably, IAI can be integrated with different image models. We
conduct extensive experiments on three VIS benchmarks. IAI outperforms all the
online competitors on YouTube-VIS-2019 (ResNet-101 41.9 mAP) and
YouTube-VIS-2021 (ResNet-50 37.7 mAP). Surprisingly, on the more challenging
OVIS, IAI achieves SOTA performance (20.3 mAP). Code is available at
https://github.com/zfonemore/IAI
Related papers
- UVIS: Unsupervised Video Instance Segmentation [65.46196594721545]
Videocaption instance segmentation requires classifying, segmenting, and tracking every object across video frames.
We propose UVIS, a novel Unsupervised Video Instance (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining.
Our framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking.
arXiv Detail & Related papers (2024-06-11T03:05:50Z) - CTVIS: Consistent Training for Online Video Instance Segmentation [62.957370691452844]
Discrimination of instance embeddings plays a vital role in associating instances across time for online video instance segmentation (VIS)
Recent online VIS methods leverage CIs sourced from one reference frame only, which we argue is insufficient for learning highly discriminative embeddings.
We propose a simple yet effective training strategy, called Consistent Training for Online VIS (CTVIS), which devotes to aligning the training and inference pipelines.
arXiv Detail & Related papers (2023-07-24T08:44:25Z) - Offline-to-Online Knowledge Distillation for Video Instance Segmentation [13.270872063217022]
We present offline-to-online knowledge distillation (OOKD) for video instance segmentation (VIS)
Our method transfers a wealth of video knowledge from an offline model to an online model for consistent prediction.
Our method also achieves state-of-the-art performance on YTVIS-21, YTVIS-22, and OVIS datasets, with mAP scores of 46.1%, 43.6%, and 31.1%, respectively.
arXiv Detail & Related papers (2023-02-15T08:24:37Z) - Two-Level Temporal Relation Model for Online Video Instance Segmentation [3.9349485816629888]
We propose an online method that is on par with the performance of the offline counterparts.
We introduce a message-passing graph neural network that encodes objects and relates them through time.
Our model achieves trained end-to-end, state-of-the-art performance on the YouTube-VIS dataset.
arXiv Detail & Related papers (2022-10-30T10:01:01Z) - STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation [47.28515170195206]
Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video.
Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions.
We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
arXiv Detail & Related papers (2022-02-08T09:34:26Z) - 1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
Segmentation [0.39146761527401414]
Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously.
We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack)
By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
arXiv Detail & Related papers (2021-06-12T00:20:38Z) - Crossover Learning for Fast Online Video Instance Segmentation [53.5613957875507]
We present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy.
arXiv Detail & Related papers (2021-04-13T06:47:40Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.