InstanceFormer: An Online Video Instance Segmentation Framework
- URL: http://arxiv.org/abs/2208.10547v1
- Date: Mon, 22 Aug 2022 18:54:18 GMT
- Title: InstanceFormer: An Online Video Instance Segmentation Framework
- Authors: Rajat Koner, Tanveer Hannan, Suprosanna Shit, Sahand Sharifzadeh,
Matthias Schubert, Thomas Seidl, Volker Tresp
- Abstract summary: We propose a single-stage transformer-based efficient online VIS framework named InstanceFormer.
We propose three novel components to model short-term and long-term dependency and temporal coherence.
The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets.
- Score: 21.760243214387987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent transformer-based offline video instance segmentation (VIS) approaches
achieve encouraging results and significantly outperform online approaches.
However, their reliance on the whole video and the immense computational
complexity caused by full Spatio-temporal attention limit them in real-life
applications such as processing lengthy videos. In this paper, we propose a
single-stage transformer-based efficient online VIS framework named
InstanceFormer, which is especially suitable for long and challenging videos.
We propose three novel components to model short-term and long-term dependency
and temporal coherence. First, we propagate the representation, location, and
semantic information of prior instances to model short-term changes. Second, we
propose a novel memory cross-attention in the decoder, which allows the network
to look into earlier instances within a certain temporal window. Finally, we
employ a temporal contrastive loss to impose coherence in the representation of
an instance across all frames. Memory attention and temporal coherence are
particularly beneficial to long-range dependency modeling, including
challenging scenarios like occlusion. The proposed InstanceFormer outperforms
previous online benchmark methods by a large margin across multiple datasets.
Most importantly, InstanceFormer surpasses offline approaches for challenging
and long datasets such as YouTube-VIS-2021 and OVIS. Code is available at
https://github.com/rajatkoner08/InstanceFormer.
Related papers
- Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams [78.72965584414368]
We present Flash-VStream, a video-language model that simulates the memory mechanism of human.
Compared to existing models, Flash-VStream achieves significant reductions in latency inference and VRAM consumption.
We propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding.
arXiv Detail & Related papers (2024-06-12T11:07:55Z) - TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking [33.75267864844047]
Video Object (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings.
We propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges.
Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations.
arXiv Detail & Related papers (2023-12-13T21:02:03Z) - DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Robust Online Video Instance Segmentation with Track Queries [15.834703258232002]
We propose a fully online transformer-based video instance segmentation model that performs comparably to top offline methods on the YouTube-VIS 2019 benchmark.
We show that, when combined with a strong enough image segmentation architecture, track queries can exhibit impressive accuracy while not being constrained to short videos.
arXiv Detail & Related papers (2022-11-16T18:50:14Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - Hybrid Instance-aware Temporal Fusion for Online Video Instance
Segmentation [23.001856276175506]
We propose an online video instance segmentation framework with a novel instance-aware temporal fusion method.
Our model achieves the best performance among all online VIS methods.
arXiv Detail & Related papers (2021-12-03T03:37:57Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - 1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
Segmentation [0.39146761527401414]
Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously.
We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack)
By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
arXiv Detail & Related papers (2021-06-12T00:20:38Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.