Consistent Video Instance Segmentation with Inter-Frame Recurrent
Attention
- URL: http://arxiv.org/abs/2206.07011v1
- Date: Tue, 14 Jun 2022 17:22:55 GMT
- Title: Consistent Video Instance Segmentation with Inter-Frame Recurrent
Attention
- Authors: Quanzeng You, Jiang Wang, Peng Chu, Andre Abrantes, Zicheng Liu
- Abstract summary: Video instance segmentation aims at predicting object segmentation masks for each frame, as well as associating the instances across multiple frames.
Recent end-to-end video instance segmentation methods are capable of performing object segmentation and instance association together in a direct parallel sequence decoding/prediction framework.
We propose a consistent end-to-end video instance segmentation framework with Inter-Frame Recurrent Attention to model both the temporal instance consistency for adjacent frames and the global temporal context.
- Score: 23.72098615213679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video instance segmentation aims at predicting object segmentation masks for
each frame, as well as associating the instances across multiple frames. Recent
end-to-end video instance segmentation methods are capable of performing object
segmentation and instance association together in a direct parallel sequence
decoding/prediction framework. Although these methods generally predict higher
quality object segmentation masks, they can fail to associate instances in
challenging cases because they do not explicitly model the temporal instance
consistency for adjacent frames. We propose a consistent end-to-end video
instance segmentation framework with Inter-Frame Recurrent Attention to model
both the temporal instance consistency for adjacent frames and the global
temporal context. Our extensive experiments demonstrate that the Inter-Frame
Recurrent Attention significantly improves temporal instance consistency while
maintaining the quality of the object segmentation masks. Our model achieves
state-of-the-art accuracy on both YouTubeVIS-2019 (62.1\%) and YouTubeVIS-2021
(54.7\%) datasets. In addition, quantitative and qualitative results show that
the proposed methods predict more temporally consistent instance segmentation
masks.
Related papers
- Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - RefineVIS: Video Instance Segmentation with Temporal Attention
Refinement [23.720986152136785]
RefineVIS learns two separate representations on top of an off-the-shelf frame-level image instance segmentation model.
A Temporal Attention Refinement (TAR) module learns discriminative segmentation representations by exploiting temporal relationships.
It achieves state-of-the-art video instance segmentation accuracy on YouTube-VIS 2019 (64.4 AP), Youtube-VIS 2021 (61.4 AP), and OVIS (46.1 AP) datasets.
arXiv Detail & Related papers (2023-06-07T20:45:15Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Contextual Guided Segmentation Framework for Semi-supervised Video
Instance Segmentation [20.174393465900156]
We propose Contextual Guided (CGS) framework for video instance segmentation in three passes.
In the first pass, i.e., preview segmentation, we propose Instance Re-Identification Flow to estimate main properties of each instance.
In the second pass, i.e., contextual segmentation, we introduce multiple contextual segmentation schemes.
Experiments conducted on the DAVIS Test-Challenge dataset demonstrate the effectiveness of our proposed framework.
arXiv Detail & Related papers (2021-06-07T04:16:50Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Weakly Supervised Instance Segmentation for Videos with Temporal Mask
Consistency [28.352140544936198]
Weakly supervised instance segmentation reduces the cost of annotations required to train models.
We show that these issues can be better addressed by training with weakly labeled videos instead of images.
We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation.
arXiv Detail & Related papers (2021-03-23T23:20:46Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.