End-to-End Video Instance Segmentation with Transformers
- URL: http://arxiv.org/abs/2011.14503v4
- Date: Sun, 25 Apr 2021 09:43:28 GMT
- Title: End-to-End Video Instance Segmentation with Transformers
- Authors: Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng,
Hao Shen, Huaxia Xia
- Abstract summary: Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
- Score: 84.17794705045333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video instance segmentation (VIS) is the task that requires simultaneously
classifying, segmenting and tracking object instances of interest in video.
Recent methods typically develop sophisticated pipelines to tackle this task.
Here, we propose a new video instance segmentation framework built upon
Transformers, termed VisTR, which views the VIS task as a direct end-to-end
parallel sequence decoding/prediction problem. Given a video clip consisting of
multiple image frames as input, VisTR outputs the sequence of masks for each
instance in the video in order directly. At the core is a new, effective
instance sequence matching and segmentation strategy, which supervises and
segments instances at the sequence level as a whole. VisTR frames the instance
segmentation and tracking in the same perspective of similarity learning, thus
considerably simplifying the overall pipeline and is significantly different
from existing approaches. Without bells and whistles, VisTR achieves the
highest speed among all existing VIS models, and achieves the best result among
methods using single model on the YouTube-VIS dataset. For the first time, we
demonstrate a much simpler and faster video instance segmentation framework
built upon Transformers, achieving competitive accuracy. We hope that VisTR can
motivate future research for more video understanding tasks.
Related papers
- Robust Online Video Instance Segmentation with Track Queries [15.834703258232002]
We propose a fully online transformer-based video instance segmentation model that performs comparably to top offline methods on the YouTube-VIS 2019 benchmark.
We show that, when combined with a strong enough image segmentation architecture, track queries can exhibit impressive accuracy while not being constrained to short videos.
arXiv Detail & Related papers (2022-11-16T18:50:14Z) - DeVIS: Making Deformable Transformers Work for Video Instance
Segmentation [4.3012765978447565]
Video Instance (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences.
Transformers recently allowed to cast the entire VIS task as a single set-prediction problem.
Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored.
arXiv Detail & Related papers (2022-07-22T14:27:45Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.