VITA: Video Instance Segmentation via Object Token Association
- URL: http://arxiv.org/abs/2206.04403v1
- Date: Thu, 9 Jun 2022 10:33:18 GMT
- Title: VITA: Video Instance Segmentation via Object Token Association
- Authors: Miran Heo, Sukjun Hwang, Seoung Wug Oh, Joon-Young Lee, Seon Joo Kim
- Abstract summary: VITA is a simple structure built on top of an off-shelf Transformer-based image instance segmentation model.
It accomplishes video-level understanding by associating frame-leveltemporal object tokens without using backbone features.
VITA achieves the state-of-the-art on VIS benchmarks with a ResNet-50 backbone: 4 AP, 49.8 AP-VIS 2019 & 2021 and 19.6 AP on OVIS.
- Score: 56.17453513956142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel paradigm for offline Video Instance Segmentation (VIS),
based on the hypothesis that explicit object-oriented information can be a
strong clue for understanding the context of the entire sequence. To this end,
we propose VITA, a simple structure built on top of an off-the-shelf
Transformer-based image instance segmentation model. Specifically, we use an
image object detector as a means of distilling object-specific contexts into
object tokens. VITA accomplishes video-level understanding by associating
frame-level object tokens without using spatio-temporal backbone features. By
effectively building relationships between objects using the condensed
information, VITA achieves the state-of-the-art on VIS benchmarks with a
ResNet-50 backbone: 49.8 AP, 45.7 AP on YouTube-VIS 2019 & 2021 and 19.6 AP on
OVIS. Moreover, thanks to its object token-based structure that is disjoint
from the backbone features, VITA shows several practical advantages that
previous offline VIS methods have not explored - handling long and
high-resolution videos with a common GPU and freezing a frame-level detector
trained on image domain. Code will be made available at
https://github.com/sukjunhwang/VITA.
Related papers
- VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - UVIS: Unsupervised Video Instance Segmentation [65.46196594721545]
Videocaption instance segmentation requires classifying, segmenting, and tracking every object across video frames.
We propose UVIS, a novel Unsupervised Video Instance (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining.
Our framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking.
arXiv Detail & Related papers (2024-06-11T03:05:50Z) - DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - Online Video Instance Segmentation via Robust Context Fusion [36.376900904288966]
Video instance segmentation (VIS) aims at classifying, segmenting and tracking object instances in video sequences.
Recent transformer-based neural networks have demonstrated their powerful capability of modeling for the VIS task.
We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames.
arXiv Detail & Related papers (2022-07-12T15:04:50Z) - Bringing Image Scene Structure to Video via Frame-Clip Consistency of
Object Tokens [93.98605636451806]
StructureViT shows how utilizing the structure of a small number of images only available during training can improve a video model.
SViT shows strong performance improvements on multiple video understanding tasks and datasets.
arXiv Detail & Related papers (2022-06-13T17:45:05Z) - Temporally Efficient Vision Transformer for Video Instance Segmentation [40.32376033054237]
We propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS)
TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head.
On three widely adopted VIS benchmarks, TeViT obtains state-of-the-art results and maintains high inference speed.
arXiv Detail & Related papers (2022-04-18T17:09:20Z) - HODOR: High-level Object Descriptors for Object Re-segmentation in Video
Learned from Static Images [123.65233334380251]
We propose HODOR: a novel method that effectively leveraging annotated static images for understanding object appearance and scene context.
As a result, HODOR achieves state-of-the-art performance on the DAVIS and YouTube-VOS benchmarks.
Without any architectural modification, HODOR can also learn from video context around single annotated video frames.
arXiv Detail & Related papers (2021-12-16T18:59:53Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - Occluded Video Instance Segmentation [133.80567761430584]
We collect a large scale dataset called OVIS for occluded video instance segmentation.
OVIS consists of 296k high-quality instance masks from 25 semantic categories.
The highest AP achieved by state-of-the-art algorithms is only 14.4, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario.
arXiv Detail & Related papers (2021-02-02T15:35:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.