Temporally Efficient Vision Transformer for Video Instance Segmentation
- URL: http://arxiv.org/abs/2204.08412v1
- Date: Mon, 18 Apr 2022 17:09:20 GMT
- Title: Temporally Efficient Vision Transformer for Video Instance Segmentation
- Authors: Shusheng Yang, Xinggang Wang, Yu Li, Yuxin Fang, Jiemin Fang, Wenyu
Liu, Xun Zhao, Ying Shan
- Abstract summary: We propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS)
TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head.
On three widely adopted VIS benchmarks, TeViT obtains state-of-the-art results and maintains high inference speed.
- Score: 40.32376033054237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently vision transformer has achieved tremendous success on image-level
visual recognition tasks. To effectively and efficiently model the crucial
temporal information within a video clip, we propose a Temporally Efficient
Vision Transformer (TeViT) for video instance segmentation (VIS). Different
from previous transformer-based VIS methods, TeViT is nearly convolution-free,
which contains a transformer backbone and a query-based video instance
segmentation head. In the backbone stage, we propose a nearly parameter-free
messenger shift mechanism for early temporal context fusion. In the head
stages, we propose a parameter-shared spatiotemporal query interaction
mechanism to build the one-to-one correspondence between video instances and
queries. Thus, TeViT fully utilizes both framelevel and instance-level temporal
context information and obtains strong temporal modeling capacity with
negligible extra computational cost. On three widely adopted VIS benchmarks,
i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains
state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with
68.9 FPS on YouTube-VIS-2019. Code is available at
https://github.com/hustvl/TeViT.
Related papers
- TDViT: Temporal Dilated Video Transformer for Dense Video Tasks [35.16197118579414]
Temporal Dilated Video Transformer (TDTTB) can efficiently extract video representations and effectively alleviate the negative effect of temporal redundancy.
Experiments are conducted on two different dense video benchmarks, i.e., ImageNet VID for video object detection and YouTube VIS for video segmentation instance.
arXiv Detail & Related papers (2024-02-14T15:41:07Z) - DeVIS: Making Deformable Transformers Work for Video Instance
Segmentation [4.3012765978447565]
Video Instance (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences.
Transformers recently allowed to cast the entire VIS task as a single set-prediction problem.
Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored.
arXiv Detail & Related papers (2022-07-22T14:27:45Z) - Video Instance Segmentation via Multi-scale Spatio-temporal Split
Attention Transformer [77.95612004326055]
Video segmentation (VIS) approaches typically utilize either single-scale-temporal features or per-frame multi-scale features during the attention computation.
We propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale-temporal (MS-STS) attention module in the encoder.
MS-STS module effectively captures split-temporal feature relationships at multiple scales across frames in a video.
arXiv Detail & Related papers (2022-03-24T17:59:20Z) - Deformable VisTR: Spatio temporal deformable attention for video
instance segmentation [79.76273774737555]
Video instance segmentation (VIS) task requires segmenting, classifying, and tracking object instances over all frames in a clip.
Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance.
We propose Deformable VisTR, leveragingtemporal deformable attention module that only attends to a small fixed set key-temporal sampling points.
arXiv Detail & Related papers (2022-03-12T02:27:14Z) - STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation [47.28515170195206]
Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video.
Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions.
We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
arXiv Detail & Related papers (2022-02-08T09:34:26Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Crossover Learning for Fast Online Video Instance Segmentation [53.5613957875507]
We present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy.
arXiv Detail & Related papers (2021-04-13T06:47:40Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.