RefineVIS: Video Instance Segmentation with Temporal Attention
Refinement
- URL: http://arxiv.org/abs/2306.04774v1
- Date: Wed, 7 Jun 2023 20:45:15 GMT
- Title: RefineVIS: Video Instance Segmentation with Temporal Attention
Refinement
- Authors: Andre Abrantes, Jiang Wang, Peng Chu, Quanzeng You, Zicheng Liu
- Abstract summary: RefineVIS learns two separate representations on top of an off-the-shelf frame-level image instance segmentation model.
A Temporal Attention Refinement (TAR) module learns discriminative segmentation representations by exploiting temporal relationships.
It achieves state-of-the-art video instance segmentation accuracy on YouTube-VIS 2019 (64.4 AP), Youtube-VIS 2021 (61.4 AP), and OVIS (46.1 AP) datasets.
- Score: 23.720986152136785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a novel framework called RefineVIS for Video Instance
Segmentation (VIS) that achieves good object association between frames and
accurate segmentation masks by iteratively refining the representations using
sequence context. RefineVIS learns two separate representations on top of an
off-the-shelf frame-level image instance segmentation model: an association
representation responsible for associating objects across frames and a
segmentation representation that produces accurate segmentation masks.
Contrastive learning is utilized to learn temporally stable association
representations. A Temporal Attention Refinement (TAR) module learns
discriminative segmentation representations by exploiting temporal
relationships and a novel temporal contrastive denoising technique. Our method
supports both online and offline inference. It achieves state-of-the-art video
instance segmentation accuracy on YouTube-VIS 2019 (64.4 AP), Youtube-VIS 2021
(61.4 AP), and OVIS (46.1 AP) datasets. The visualization shows that the TAR
module can generate more accurate instance segmentation masks, particularly for
challenging cases such as highly occluded objects.
Related papers
- Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Consistent Video Instance Segmentation with Inter-Frame Recurrent
Attention [23.72098615213679]
Video instance segmentation aims at predicting object segmentation masks for each frame, as well as associating the instances across multiple frames.
Recent end-to-end video instance segmentation methods are capable of performing object segmentation and instance association together in a direct parallel sequence decoding/prediction framework.
We propose a consistent end-to-end video instance segmentation framework with Inter-Frame Recurrent Attention to model both the temporal instance consistency for adjacent frames and the global temporal context.
arXiv Detail & Related papers (2022-06-14T17:22:55Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.