Deformable VisTR: Spatio temporal deformable attention for video
instance segmentation
- URL: http://arxiv.org/abs/2203.06318v1
- Date: Sat, 12 Mar 2022 02:27:14 GMT
- Title: Deformable VisTR: Spatio temporal deformable attention for video
instance segmentation
- Authors: Sudhir Yarram, Jialian Wu, Pan Ji, Yi Xu, Junsong Yuan
- Abstract summary: Video instance segmentation (VIS) task requires segmenting, classifying, and tracking object instances over all frames in a clip.
Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance.
We propose Deformable VisTR, leveragingtemporal deformable attention module that only attends to a small fixed set key-temporal sampling points.
- Score: 79.76273774737555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video instance segmentation (VIS) task requires classifying, segmenting, and
tracking object instances over all frames in a video clip. Recently, VisTR has
been proposed as end-to-end transformer-based VIS framework, while
demonstrating state-of-the-art performance. However, VisTR is slow to converge
during training, requiring around 1000 GPU hours due to the high computational
cost of its transformer attention module. To improve the training efficiency,
we propose Deformable VisTR, leveraging spatio-temporal deformable attention
module that only attends to a small fixed set of key spatio-temporal sampling
points around a reference point. This enables Deformable VisTR to achieve
linear computation in the size of spatio-temporal feature maps. Moreover, it
can achieve on par performance as the original VisTR with 10$\times$ less GPU
training hours. We validate the effectiveness of our method on the Youtube-VIS
benchmark. Code is available at https://github.com/skrya/DefVIS.
Related papers
- Two-Level Temporal Relation Model for Online Video Instance Segmentation [3.9349485816629888]
We propose an online method that is on par with the performance of the offline counterparts.
We introduce a message-passing graph neural network that encodes objects and relates them through time.
Our model achieves trained end-to-end, state-of-the-art performance on the YouTube-VIS dataset.
arXiv Detail & Related papers (2022-10-30T10:01:01Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - DeVIS: Making Deformable Transformers Work for Video Instance
Segmentation [4.3012765978447565]
Video Instance (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences.
Transformers recently allowed to cast the entire VIS task as a single set-prediction problem.
Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored.
arXiv Detail & Related papers (2022-07-22T14:27:45Z) - Temporally Efficient Vision Transformer for Video Instance Segmentation [40.32376033054237]
We propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS)
TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head.
On three widely adopted VIS benchmarks, TeViT obtains state-of-the-art results and maintains high inference speed.
arXiv Detail & Related papers (2022-04-18T17:09:20Z) - Video Instance Segmentation via Multi-scale Spatio-temporal Split
Attention Transformer [77.95612004326055]
Video segmentation (VIS) approaches typically utilize either single-scale-temporal features or per-frame multi-scale features during the attention computation.
We propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale-temporal (MS-STS) attention module in the encoder.
MS-STS module effectively captures split-temporal feature relationships at multiple scales across frames in a video.
arXiv Detail & Related papers (2022-03-24T17:59:20Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.