SG-Net: Spatial Granularity Network for One-Stage Video Instance
Segmentation
- URL: http://arxiv.org/abs/2103.10284v1
- Date: Thu, 18 Mar 2021 14:31:15 GMT
- Title: SG-Net: Spatial Granularity Network for One-Stage Video Instance
Segmentation
- Authors: Dongfang Liu, Yiming Cui, Wenbo Tan, Yingjie Chen
- Abstract summary: Video instance segmentation (VIS) is a new and critical task in computer vision.
We propose a one-stage spatial granularity network (SG-Net) for VIS.
We show that our method can achieve improved performance in both accuracy and inference speed.
- Score: 7.544917072241684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video instance segmentation (VIS) is a new and critical task in computer
vision. To date, top-performing VIS methods extend the two-stage Mask R-CNN by
adding a tracking branch, leaving plenty of room for improvement. In contrast,
we approach the VIS task from a new perspective and propose a one-stage spatial
granularity network (SG-Net). Compared to the conventional two-stage methods,
SG-Net demonstrates four advantages: 1) Our method has a one-stage compact
architecture and each task head (detection, segmentation, and tracking) is
crafted interdependently so they can effectively share features and enjoy the
joint optimization; 2) Our mask prediction is dynamically performed on the
sub-regions of each detected instance, leading to high-quality masks of fine
granularity; 3) Each of our task predictions avoids using expensive
proposal-based RoI features, resulting in much reduced runtime complexity per
instance; 4) Our tracking head models objects centerness movements for
tracking, which effectively enhances the tracking robustness to different
object appearances. In evaluation, we present state-of-the-art comparisons on
the YouTube-VIS dataset. Extensive experiments demonstrate that our compact
one-stage method can achieve improved performance in both accuracy and
inference speed. We hope our SG-Net could serve as a strong and flexible
baseline for the VIS task. Our code will be available.
Related papers
- DynaSeg: A Deep Dynamic Fusion Method for Unsupervised Image Segmentation Incorporating Feature Similarity and Spatial Continuity [0.5755004576310334]
We introduce DynaSeg, an innovative unsupervised image segmentation approach.
Unlike traditional methods, DynaSeg employs a dynamic weighting scheme that adapts flexibly to image characteristics.
DynaSeg prevents undersegmentation failures where the number of predicted clusters might converge to one.
arXiv Detail & Related papers (2024-05-09T00:30:45Z) - Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised
Framework with Spatio-Temporal Collaboration [13.284951215948052]
We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos.
Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
arXiv Detail & Related papers (2022-12-15T02:44:13Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - A Unified Transformer Framework for Group-based Segmentation:
Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [59.21990697929617]
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world.
Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other.
We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
arXiv Detail & Related papers (2022-03-09T13:35:19Z) - STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation [47.28515170195206]
Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video.
Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions.
We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
arXiv Detail & Related papers (2022-02-08T09:34:26Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - Visual Object Tracking by Segmentation with Graph Convolutional Network [7.729569666460712]
We propose to utilize graph convolutional network (GCN) model for superpixel based object tracking.
The proposed model provides a general end-to-end framework which integrates i) label linear prediction, and ii) structure-aware feature information of each superpixel together.
arXiv Detail & Related papers (2020-09-05T12:43:21Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.