1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
Segmentation
- URL: http://arxiv.org/abs/2106.06649v1
- Date: Sat, 12 Jun 2021 00:20:38 GMT
- Title: 1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
Segmentation
- Authors: Thuy C. Nguyen, Tuan N. Tang, Nam LH. Phan, Chuong H. Nguyen, Masayuki
Yamazaki, Masao Yamanaka
- Abstract summary: Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously.
We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack)
By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
- Score: 0.39146761527401414
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video Instance Segmentation (VIS) is a multi-task problem performing
detection, segmentation, and tracking simultaneously. Extended from image set
applications, video data additionally induces the temporal information, which,
if handled appropriately, is very useful to identify and predict object
motions. In this work, we design a unified model to mutually learn these tasks.
Specifically, we propose two modules, named Temporally Correlated Instance
Segmentation (TCIS) and Bidirectional Tracking (BiTrack), to take the benefit
of the temporal correlation between the object's instance masks across adjacent
frames. On the other hand, video data is often redundant due to the frame's
overlap. Our analysis shows that this problem is particularly severe for the
YoutubeVOS-VIS2021 data. Therefore, we propose a Multi-Source Data (MSD)
training mechanism to compensate for the data deficiency. By combining these
techniques with a bag of tricks, the network performance is significantly
boosted compared to the baseline, and outperforms other methods by a
considerable margin on the YoutubeVOS-VIS 2019 and 2021 datasets.
Related papers
- Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z) - Human Instance Segmentation and Tracking via Data Association and
Single-stage Detector [17.46922710432633]
Human video instance segmentation plays an important role in computer understanding of human activities.
Most current VIS methods are based on Mask-RCNN framework.
We develop a new method for human video instance segmentation based on single-stage detector.
arXiv Detail & Related papers (2022-03-31T11:36:09Z) - Weakly Supervised Instance Segmentation for Videos with Temporal Mask
Consistency [28.352140544936198]
Weakly supervised instance segmentation reduces the cost of annotations required to train models.
We show that these issues can be better addressed by training with weakly labeled videos instead of images.
We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation.
arXiv Detail & Related papers (2021-03-23T23:20:46Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z) - Hybrid-S2S: Video Object Segmentation with Recurrent Networks and
Correspondence Matching [3.9053553775979086]
One-shot Video Object(VOS) is the task of tracking an object of interest within a video sequence.
We study an RNN-based architecture and address some of these issues by proposing a hybrid sequence-to-sequence architecture named HS2S.
Our experiments show that augmenting the RNN with correspondence matching is a highly effective solution to reduce the drift problem.
arXiv Detail & Related papers (2020-10-10T19:00:43Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.