Video Instance Segmentation via Multi-scale Spatio-temporal Split
Attention Transformer
- URL: http://arxiv.org/abs/2203.13253v1
- Date: Thu, 24 Mar 2022 17:59:20 GMT
- Title: Video Instance Segmentation via Multi-scale Spatio-temporal Split
Attention Transformer
- Authors: Omkar Thawakar, Sanath Narayan, Jiale Cao, Hisham Cholakkal, Rao
Muhammad Anwer, Muhammad Haris Khan, Salman Khan, Michael Felsberg and Fahad
Shahbaz Khan
- Abstract summary: Video segmentation (VIS) approaches typically utilize either single-scale-temporal features or per-frame multi-scale features during the attention computation.
We propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale-temporal (MS-STS) attention module in the encoder.
MS-STS module effectively captures split-temporal feature relationships at multiple scales across frames in a video.
- Score: 77.95612004326055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art transformer-based video instance segmentation (VIS)
approaches typically utilize either single-scale spatio-temporal features or
per-frame multi-scale features during the attention computations. We argue that
such an attention computation ignores the multi-scale spatio-temporal feature
relationships that are crucial to tackle target appearance deformations in
videos. To address this issue, we propose a transformer-based VIS framework,
named MS-STS VIS, that comprises a novel multi-scale spatio-temporal split
(MS-STS) attention module in the encoder. The proposed MS-STS module
effectively captures spatio-temporal feature relationships at multiple scales
across frames in a video. We further introduce an attention block in the
decoder to enhance the temporal consistency of the detected instances in
different frames of a video. Moreover, an auxiliary discriminator is introduced
during training to ensure better foreground-background separability within the
multi-scale spatio-temporal feature space. We conduct extensive experiments on
two benchmarks: Youtube-VIS (2019 and 2021). Our MS-STS VIS achieves
state-of-the-art performance on both benchmarks. When using the ResNet50
backbone, our MS-STS achieves a mask AP of 50.1 %, outperforming the best
reported results in literature by 2.7 % and by 4.8 % at higher overlap
threshold of AP_75, while being comparable in model size and speed on
Youtube-VIS 2019 val. set. When using the Swin Transformer backbone, MS-STS VIS
achieves mask AP of 61.0 % on Youtube-VIS 2019 val. set. Our code and models
are available at https://github.com/OmkarThawakar/MSSTS-VIS.
Related papers
- UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - DeVIS: Making Deformable Transformers Work for Video Instance
Segmentation [4.3012765978447565]
Video Instance (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences.
Transformers recently allowed to cast the entire VIS task as a single set-prediction problem.
Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored.
arXiv Detail & Related papers (2022-07-22T14:27:45Z) - Temporally Efficient Vision Transformer for Video Instance Segmentation [40.32376033054237]
We propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS)
TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head.
On three widely adopted VIS benchmarks, TeViT obtains state-of-the-art results and maintains high inference speed.
arXiv Detail & Related papers (2022-04-18T17:09:20Z) - Deformable VisTR: Spatio temporal deformable attention for video
instance segmentation [79.76273774737555]
Video instance segmentation (VIS) task requires segmenting, classifying, and tracking object instances over all frames in a clip.
Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance.
We propose Deformable VisTR, leveragingtemporal deformable attention module that only attends to a small fixed set key-temporal sampling points.
arXiv Detail & Related papers (2022-03-12T02:27:14Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - 1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
Segmentation [0.39146761527401414]
Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously.
We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack)
By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
arXiv Detail & Related papers (2021-06-12T00:20:38Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.