Related papers: SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation

SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation

URL: http://arxiv.org/abs/2103.10284v1
Date: Thu, 18 Mar 2021 14:31:15 GMT
Title: SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation
Authors: Dongfang Liu, Yiming Cui, Wenbo Tan, Yingjie Chen
Abstract summary: Video instance segmentation (VIS) is a new and critical task in computer vision. We propose a one-stage spatial granularity network (SG-Net) for VIS. We show that our method can achieve improved performance in both accuracy and inference speed.
Score: 7.544917072241684
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video instance segmentation (VIS) is a new and critical task in computer vision. To date, top-performing VIS methods extend the two-stage Mask R-CNN by adding a tracking branch, leaving plenty of room for improvement. In contrast, we approach the VIS task from a new perspective and propose a one-stage spatial granularity network (SG-Net). Compared to the conventional two-stage methods, SG-Net demonstrates four advantages: 1) Our method has a one-stage compact architecture and each task head (detection, segmentation, and tracking) is crafted interdependently so they can effectively share features and enjoy the joint optimization; 2) Our mask prediction is dynamically performed on the sub-regions of each detected instance, leading to high-quality masks of fine granularity; 3) Each of our task predictions avoids using expensive proposal-based RoI features, resulting in much reduced runtime complexity per instance; 4) Our tracking head models objects centerness movements for tracking, which effectively enhances the tracking robustness to different object appearances. In evaluation, we present state-of-the-art comparisons on the YouTube-VIS dataset. Extensive experiments demonstrate that our compact one-stage method can achieve improved performance in both accuracy and inference speed. We hope our SG-Net could serve as a strong and flexible baseline for the VIS task. Our code will be available.

Related papers

Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation [15.414518995812754]
Novel Instance Detection and computation (NIDS) aims at detecting and segmenting novel object instances. We propose a unified, simple, yet effective framework (NIDS-Net) comprising object proposal generation, embedding creation for both instance templates and proposal regions, and embedding matching for instance label assignment.
arXiv Detail & Related papers (2024-05-28T06:16:57Z)
DynaSeg: A Deep Dynamic Fusion Method for Unsupervised Image Segmentation Incorporating Feature Similarity and Spatial Continuity [0.5755004576310334]
We introduce DynaSeg, an innovative unsupervised image segmentation approach. Unlike traditional methods, DynaSeg employs a dynamic weighting scheme that adapts flexibly to image characteristics. DynaSeg prevents undersegmentation failures where the number of predicted clusters might converge to one.
arXiv Detail & Related papers (2024-05-09T00:30:45Z)
Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration [13.284951215948052]
We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos. Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
arXiv Detail & Related papers (2022-12-15T02:44:13Z)
Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z)
A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [59.21990697929617]
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world. Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other. We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
arXiv Detail & Related papers (2022-03-09T13:35:19Z)
STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation [47.28515170195206]
Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video. Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions. We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
arXiv Detail & Related papers (2022-02-08T09:34:26Z)
Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS) Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage. We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z)
Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation. We introduce a novel approach for more accurate and efficient unseen-temporal segmentation. We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z)
Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences. We show that even when only trained with images, the learned feature representation is robust to instance appearance variations. In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z)
Visual Object Tracking by Segmentation with Graph Convolutional Network [7.729569666460712]
We propose to utilize graph convolutional network (GCN) model for superpixel based object tracking. The proposed model provides a general end-to-end framework which integrates i) label linear prediction, and ii) structure-aware feature information of each superpixel together.
arXiv Detail & Related papers (2020-09-05T12:43:21Z)
Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS) We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance. We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.