Object Propagation via Inter-Frame Attentions for Temporally Stable
Video Instance Segmentation
- URL: http://arxiv.org/abs/2111.07529v1
- Date: Mon, 15 Nov 2021 04:15:57 GMT
- Title: Object Propagation via Inter-Frame Attentions for Temporally Stable
Video Instance Segmentation
- Authors: Anirudh S Chakravarthy, Won-Dong Jang, Zudi Lin, Donglai Wei, Song
Bai, Hanspeter Pfister
- Abstract summary: Video instance segmentation aims to detect, segment, and track objects in a video.
Current approaches extend image-level segmentation algorithms to the temporal domain.
We propose a video instance segmentation method that alleviates the problem due to missing detections.
- Score: 51.68840525174265
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Video instance segmentation aims to detect, segment, and track objects in a
video. Current approaches extend image-level segmentation algorithms to the
temporal domain. However, this results in temporally inconsistent masks. In
this work, we identify the mask quality due to temporal stability as a
performance bottleneck. Motivated by this, we propose a video instance
segmentation method that alleviates the problem due to missing detections.
Since this cannot be solved simply using spatial information, we leverage
temporal context using inter-frame attentions. This allows our network to
refocus on missing objects using box predictions from the neighbouring frame,
thereby overcoming missing detections. Our method significantly outperforms
previous state-of-the-art algorithms using the Mask R-CNN backbone, by
achieving 35.1% mAP on the YouTube-VIS benchmark. Additionally, our method is
completely online and requires no future frames. Our code is publicly available
at https://github.com/anirudh-chakravarthy/ObjProp.
Related papers
- Space-time Reinforcement Network for Video Object Segmentation [16.67780344875854]
Video object segmentation (VOS) networks typically use memory-based methods.
These methods suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames, and 2) Pixel-level matching will lead to undesired mismatching.
In this paper, we propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one.
arXiv Detail & Related papers (2024-05-07T06:26:30Z) - Detect Any Shadow: Segment Anything for Video Shadow Detection [105.19693622157462]
We propose ShadowSAM, a framework for fine-tuning segment anything model (SAM) to detect shadows.
By combining it with long short-term attention mechanism, we extend its capability for efficient video shadow detection.
Our method exhibits accelerated inference speed compared to previous video shadow detection approaches.
arXiv Detail & Related papers (2023-05-26T07:39:10Z) - Mask-Free Video Instance Segmentation [102.50936366583106]
Video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets.
We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state.
Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection.
arXiv Detail & Related papers (2023-03-28T11:48:07Z) - End-to-end video instance segmentation via spatial-temporal graph neural
networks [30.748756362692184]
Video instance segmentation is a challenging task that extends image instance segmentation to the video domain.
Existing methods either rely only on single-frame information for the detection and segmentation subproblems or handle tracking as a separate post-processing step.
We propose a novel graph-neural-network (GNN) based method to handle the aforementioned limitation.
arXiv Detail & Related papers (2022-03-07T05:38:08Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised
Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame.
We exploit this observation by using temporal information to quickly identify frames with minimal change.
We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - Towards Accurate Pixel-wise Object Tracking by Attention Retrieval [50.06436600343181]
We propose an attention retrieval network (ARN) to perform soft spatial constraints on backbone features.
We set a new state-of-the-art on recent pixel-wise object tracking benchmark VOT 2020 while running at 40 fps.
arXiv Detail & Related papers (2020-08-06T16:25:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.