Hybrid Instance-aware Temporal Fusion for Online Video Instance
Segmentation
- URL: http://arxiv.org/abs/2112.01695v1
- Date: Fri, 3 Dec 2021 03:37:57 GMT
- Title: Hybrid Instance-aware Temporal Fusion for Online Video Instance
Segmentation
- Authors: Xiang Li, Jinglu Wang, Xiao Li, Yan Lu
- Abstract summary: We propose an online video instance segmentation framework with a novel instance-aware temporal fusion method.
Our model achieves the best performance among all online VIS methods.
- Score: 23.001856276175506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, transformer-based image segmentation methods have achieved notable
success against previous solutions. While for video domains, how to effectively
model temporal context with the attention of object instances across frames
remains an open problem. In this paper, we propose an online video instance
segmentation framework with a novel instance-aware temporal fusion method. We
first leverages the representation, i.e., a latent code in the global context
(instance code) and CNN feature maps to represent instance- and pixel-level
features. Based on this representation, we introduce a cropping-free temporal
fusion approach to model the temporal consistency between video frames.
Specifically, we encode global instance-specific information in the instance
code and build up inter-frame contextual fusion with hybrid attentions between
the instance codes and CNN feature maps. Inter-frame consistency between the
instance codes are further enforced with order constraints. By leveraging the
learned hybrid temporal consistency, we are able to directly retrieve and
maintain instance identities across frames, eliminating the complicated
frame-wise instance matching in prior methods. Extensive experiments have been
conducted on popular VIS datasets, i.e. Youtube-VIS-19/21. Our model achieves
the best performance among all online VIS methods. Notably, our model also
eclipses all offline methods when using the ResNet-50 backbone.
Related papers
- Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - InsPro: Propagating Instance Query and Proposal for Online Video
Instance Segmentation [41.85216306978024]
Video instance segmentation (VIS) aims at segmenting and tracking objects in videos.
Prior methods generate frame-level or clip-level object instances first and then associate them by either additional tracking heads or complex instance matching algorithms.
In this paper, we design a simple, fast and yet effective query-based framework for online VIS.
arXiv Detail & Related papers (2023-01-05T02:41:20Z) - Two-Level Temporal Relation Model for Online Video Instance Segmentation [3.9349485816629888]
We propose an online method that is on par with the performance of the offline counterparts.
We introduce a message-passing graph neural network that encodes objects and relates them through time.
Our model achieves trained end-to-end, state-of-the-art performance on the YouTube-VIS dataset.
arXiv Detail & Related papers (2022-10-30T10:01:01Z) - InstanceFormer: An Online Video Instance Segmentation Framework [21.760243214387987]
We propose a single-stage transformer-based efficient online VIS framework named InstanceFormer.
We propose three novel components to model short-term and long-term dependency and temporal coherence.
The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets.
arXiv Detail & Related papers (2022-08-22T18:54:18Z) - Consistent Video Instance Segmentation with Inter-Frame Recurrent
Attention [23.72098615213679]
Video instance segmentation aims at predicting object segmentation masks for each frame, as well as associating the instances across multiple frames.
Recent end-to-end video instance segmentation methods are capable of performing object segmentation and instance association together in a direct parallel sequence decoding/prediction framework.
We propose a consistent end-to-end video instance segmentation framework with Inter-Frame Recurrent Attention to model both the temporal instance consistency for adjacent frames and the global temporal context.
arXiv Detail & Related papers (2022-06-14T17:22:55Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - Video Instance Segmentation by Instance Flow Assembly [23.001856276175506]
Bottom-up methods dealing with box-free features could offer accurate spacial correlations across frames.
We propose our framework equipped with a temporal context fusion module to better encode inter-frame correlations.
Experiments demonstrate that the proposed method outperforms the state-of-the-art online methods (taking image-level input) on the challenging Youtube-VIS dataset.
arXiv Detail & Related papers (2021-10-20T14:49:28Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.