Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation
- URL: http://arxiv.org/abs/2012.05499v1
- Date: Thu, 10 Dec 2020 07:57:44 GMT
- Title: Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation
- Authors: Daizong Liu, Shuangjie Xu, Xiao-Yang Liu, Zichuan Xu, Wei Wei, Pan
Zhou
- Abstract summary: This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
- Score: 70.97625552643493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the task of segmenting class-agnostic objects in
semi-supervised setting. Although previous detection based methods achieve
relatively good performance, these approaches extract the best proposal by a
greedy strategy, which may lose the local patch details outside the chosen
candidate. In this paper, we propose a novel spatiotemporal graph neural
network (STG-Net) to reconstruct more accurate masks for video object
segmentation, which captures the local contexts by utilizing all proposals. In
the spatial graph, we treat object proposals of a frame as nodes and represent
their correlations with an edge weight strategy for mask context aggregation.
To capture temporal information from previous frames, we use a memory network
to refine the mask of current frame by retrieving historic masks in a temporal
graph. The joint use of both local patch details and temporal relationships
allow us to better address the challenges such as object occlusion and missing.
Without online learning and fine-tuning, our STG-Net achieves state-of-the-art
performance on four large benchmarks (DAVIS, YouTube-VOS, SegTrack-v2, and
YouTube-Objects), demonstrating the effectiveness of the proposed approach.
Related papers
- Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - End-to-end video instance segmentation via spatial-temporal graph neural
networks [30.748756362692184]
Video instance segmentation is a challenging task that extends image instance segmentation to the video domain.
Existing methods either rely only on single-frame information for the detection and segmentation subproblems or handle tracking as a separate post-processing step.
We propose a novel graph-neural-network (GNN) based method to handle the aforementioned limitation.
arXiv Detail & Related papers (2022-03-07T05:38:08Z) - GP-S3Net: Graph-based Panoptic Sparse Semantic Segmentation Network [1.9949920338542213]
GP-S3Net is a proposal-free approach in which no object proposals are needed to identify the objects.
Our new design consists of a novel instance-level network to process the semantic results.
Extensive experiments demonstrate that GP-S3Net outperforms the current state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-18T21:49:58Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Learning Spatio-Appearance Memory Network for High-Performance Visual
Tracking [79.80401607146987]
Existing object tracking usually learns a bounding-box based template to match visual targets across frames, which cannot accurately learn a pixel-wise representation.
This paper presents a novel segmentation-based tracking architecture, which is equipped with a local-temporal memory network to learn accurate-temporal correspondence.
arXiv Detail & Related papers (2020-09-21T08:12:02Z) - Towards Accurate Pixel-wise Object Tracking by Attention Retrieval [50.06436600343181]
We propose an attention retrieval network (ARN) to perform soft spatial constraints on backbone features.
We set a new state-of-the-art on recent pixel-wise object tracking benchmark VOT 2020 while running at 40 fps.
arXiv Detail & Related papers (2020-08-06T16:25:23Z) - Dual Temporal Memory Network for Efficient Video Object Segmentation [42.05305410986511]
One of the fundamental challenges in Video Object (VOS) is how to make the most use of the temporal information to boost the performance.
We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories.
Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network.
arXiv Detail & Related papers (2020-03-13T06:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.