DeVIS: Making Deformable Transformers Work for Video Instance
Segmentation
- URL: http://arxiv.org/abs/2207.11103v1
- Date: Fri, 22 Jul 2022 14:27:45 GMT
- Title: DeVIS: Making Deformable Transformers Work for Video Instance
Segmentation
- Authors: Adri\`a Caelles and Tim Meinhardt and Guillem Bras\'o and Laura
Leal-Taix\'e
- Abstract summary: Video Instance (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences.
Transformers recently allowed to cast the entire VIS task as a single set-prediction problem.
Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored.
- Score: 4.3012765978447565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Instance Segmentation (VIS) jointly tackles multi-object detection,
tracking, and segmentation in video sequences. In the past, VIS methods
mirrored the fragmentation of these subtasks in their architectural design,
hence missing out on a joint solution. Transformers recently allowed to cast
the entire VIS task as a single set-prediction problem. Nevertheless, the
quadratic complexity of existing Transformer-based methods requires long
training times, high memory requirements, and processing of low-single-scale
feature maps. Deformable attention provides a more efficient alternative but
its application to the temporal domain or the segmentation task have not yet
been explored.
In this work, we present Deformable VIS (DeVIS), a VIS method which
capitalizes on the efficiency and performance of deformable Transformers. To
reason about all VIS subtasks jointly over multiple frames, we present temporal
multi-scale deformable attention with instance-aware object queries. We further
introduce a new image and video instance mask head with multi-scale features,
and perform near-online video processing with multi-cue clip tracking. DeVIS
reduces memory as well as training time requirements, and achieves
state-of-the-art results on the YouTube-VIS 2021, as well as the challenging
OVIS dataset.
Code is available at https://github.com/acaelles97/DeVIS.
Related papers
- UVIS: Unsupervised Video Instance Segmentation [65.46196594721545]
Videocaption instance segmentation requires classifying, segmenting, and tracking every object across video frames.
We propose UVIS, a novel Unsupervised Video Instance (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining.
Our framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking.
arXiv Detail & Related papers (2024-06-11T03:05:50Z) - Robust Online Video Instance Segmentation with Track Queries [15.834703258232002]
We propose a fully online transformer-based video instance segmentation model that performs comparably to top offline methods on the YouTube-VIS 2019 benchmark.
We show that, when combined with a strong enough image segmentation architecture, track queries can exhibit impressive accuracy while not being constrained to short videos.
arXiv Detail & Related papers (2022-11-16T18:50:14Z) - MinVIS: A Minimal Video Instance Segmentation Framework without
Video-based Training [84.81566912372328]
MinVIS is a minimal video instance segmentation framework.
It achieves state-of-the-art VIS performance with neither video-based architectures nor training procedures.
arXiv Detail & Related papers (2022-08-03T17:50:42Z) - Temporally Efficient Vision Transformer for Video Instance Segmentation [40.32376033054237]
We propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS)
TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head.
On three widely adopted VIS benchmarks, TeViT obtains state-of-the-art results and maintains high inference speed.
arXiv Detail & Related papers (2022-04-18T17:09:20Z) - Deformable VisTR: Spatio temporal deformable attention for video
instance segmentation [79.76273774737555]
Video instance segmentation (VIS) task requires segmenting, classifying, and tracking object instances over all frames in a clip.
Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance.
We propose Deformable VisTR, leveragingtemporal deformable attention module that only attends to a small fixed set key-temporal sampling points.
arXiv Detail & Related papers (2022-03-12T02:27:14Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - 1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
Segmentation [0.39146761527401414]
Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously.
We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack)
By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
arXiv Detail & Related papers (2021-06-12T00:20:38Z) - Crossover Learning for Fast Online Video Instance Segmentation [53.5613957875507]
We present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy.
arXiv Detail & Related papers (2021-04-13T06:47:40Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.