Online Video Instance Segmentation via Robust Context Fusion
- URL: http://arxiv.org/abs/2207.05580v1
- Date: Tue, 12 Jul 2022 15:04:50 GMT
- Title: Online Video Instance Segmentation via Robust Context Fusion
- Authors: Xiang Li, Jinglu Wang, Xiaohao Xu, Bhiksha Raj, Yan Lu
- Abstract summary: Video instance segmentation (VIS) aims at classifying, segmenting and tracking object instances in video sequences.
Recent transformer-based neural networks have demonstrated their powerful capability of modeling for the VIS task.
We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames.
- Score: 36.376900904288966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video instance segmentation (VIS) aims at classifying, segmenting and
tracking object instances in video sequences. Recent transformer-based neural
networks have demonstrated their powerful capability of modeling
spatio-temporal correlations for the VIS task. Relying on video- or clip-level
input, they suffer from high latency and computational cost. We propose a
robust context fusion network to tackle VIS in an online fashion, which
predicts instance segmentation frame-by-frame with a few preceding frames. To
acquire the precise and temporal-consistent prediction for each frame
efficiently, the key idea is to fuse effective and compact context from
reference frames into the target frame. Considering the different effects of
reference and target frames on the target prediction, we first summarize
contextual features through importance-aware compression. A transformer encoder
is adopted to fuse the compressed context. Then, we leverage an
order-preserving instance embedding to convey the identity-aware information
and correspond the identities to predicted instance masks. We demonstrate that
our robust fusion network achieves the best performance among existing online
VIS methods and is even better than previously published clip-level methods on
the Youtube-VIS 2019 and 2021 benchmarks. In addition, visual objects often
have acoustic signatures that are naturally synchronized with them in
audio-bearing video recordings. By leveraging the flexibility of our context
fusion network on multi-modal data, we further investigate the influence of
audios on the video-dense prediction task, which has never been discussed in
existing works. We build up an Audio-Visual Instance Segmentation dataset, and
demonstrate that acoustic signals in the wild scenarios could benefit the VIS
task.
Related papers
- Context-Aware Video Instance Segmentation [12.71520768233772]
We introduce the Context-Aware Video Instance (CAVIS), a novel framework designed to enhance instance association.
We propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy.
We also introduce the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames.
arXiv Detail & Related papers (2024-07-03T11:11:16Z) - Collaboratively Self-supervised Video Representation Learning for Action
Recognition [58.195372471117615]
We design a Collaboratively Self-supervised Video Representation learning framework specific to action recognition.
Our method achieves state-of-the-art performance on the UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2024-01-15T10:42:04Z) - DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - RefineVIS: Video Instance Segmentation with Temporal Attention
Refinement [23.720986152136785]
RefineVIS learns two separate representations on top of an off-the-shelf frame-level image instance segmentation model.
A Temporal Attention Refinement (TAR) module learns discriminative segmentation representations by exploiting temporal relationships.
It achieves state-of-the-art video instance segmentation accuracy on YouTube-VIS 2019 (64.4 AP), Youtube-VIS 2021 (61.4 AP), and OVIS (46.1 AP) datasets.
arXiv Detail & Related papers (2023-06-07T20:45:15Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - Siamese Network with Interactive Transformer for Video Object
Segmentation [34.202137199782804]
We propose a network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames.
We employ the backbone architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods.
arXiv Detail & Related papers (2021-12-28T03:38:17Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.