TCOVIS: Temporally Consistent Online Video Instance Segmentation
- URL: http://arxiv.org/abs/2309.11857v1
- Date: Thu, 21 Sep 2023 07:59:15 GMT
- Title: TCOVIS: Temporally Consistent Online Video Instance Segmentation
- Authors: Junlong Li, Bingyao Yu, Yongming Rao, Jie Zhou, Jiwen Lu
- Abstract summary: We propose a novel online method for video instance segmentation called TCOVIS.
The core of our method consists of a global instance assignment strategy and a video-temporal enhancement module.
We evaluate our method on four VIS benchmarks and achieve state-of-the-art performance on all benchmarks without bells-and-whistles.
- Score: 98.29026693059444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, significant progress has been made in video instance
segmentation (VIS), with many offline and online methods achieving
state-of-the-art performance. While offline methods have the advantage of
producing temporally consistent predictions, they are not suitable for
real-time scenarios. Conversely, online methods are more practical, but
maintaining temporal consistency remains a challenging task. In this paper, we
propose a novel online method for video instance segmentation, called TCOVIS,
which fully exploits the temporal information in a video clip. The core of our
method consists of a global instance assignment strategy and a spatio-temporal
enhancement module, which improve the temporal consistency of the features from
two aspects. Specifically, we perform global optimal matching between the
predictions and ground truth across the whole video clip, and supervise the
model with the global optimal objective. We also capture the spatial feature
and aggregate it with the semantic feature between frames, thus realizing the
spatio-temporal enhancement. We evaluate our method on four widely adopted VIS
benchmarks, namely YouTube-VIS 2019/2021/2022 and OVIS, and achieve
state-of-the-art performance on all benchmarks without bells-and-whistles. For
instance, on YouTube-VIS 2021, TCOVIS achieves 49.5 AP and 61.3 AP with
ResNet-50 and Swin-L backbones, respectively. Code is available at
https://github.com/jun-long-li/TCOVIS.
Related papers
- NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation [22.200700685751826]
Video Instance (VIS) community operated under the common belief that offline methods are generally superior to a frame by frame online processing.
We present a detailed analysis on different processing paradigms and a new end-to-end Video Instance method.
Our NOVIS represents the first near-online VIS approach which avoids any handcrafted trackings.
arXiv Detail & Related papers (2023-08-29T12:51:04Z) - OnlineRefer: A Simple Online Baseline for Referring Video Object
Segmentation [75.07460026246582]
Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction.
Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding.
We propose a simple yet effective online model using explicit query propagation, named OnlineRefer.
arXiv Detail & Related papers (2023-07-18T15:43:35Z) - DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z) - A Generalized Framework for Video Instance Segmentation [49.41441806931224]
The handling of long videos with complex and occluded sequences has emerged as a new challenge in the video instance segmentation (VIS) community.
We propose a Generalized framework for VIS, namely GenVIS, that achieves state-of-the-art performance on challenging benchmarks.
We evaluate our approach on popular VIS benchmarks, achieving state-of-the-art results on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS)
arXiv Detail & Related papers (2022-11-16T11:17:19Z) - Two-Level Temporal Relation Model for Online Video Instance Segmentation [3.9349485816629888]
We propose an online method that is on par with the performance of the offline counterparts.
We introduce a message-passing graph neural network that encodes objects and relates them through time.
Our model achieves trained end-to-end, state-of-the-art performance on the YouTube-VIS dataset.
arXiv Detail & Related papers (2022-10-30T10:01:01Z) - In Defense of Online Models for Video Instance Segmentation [70.16915119724757]
We propose an online framework based on contrastive learning that is able to learn more discriminative instance embeddings for association.
Despite its simplicity, our method outperforms all online and offline methods on three benchmarks.
The proposed method won first place in the video instance segmentation track of the 4th Large-scale Video Object Challenge.
arXiv Detail & Related papers (2022-07-21T17:56:54Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation [47.28515170195206]
Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video.
Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions.
We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
arXiv Detail & Related papers (2022-02-08T09:34:26Z) - Crossover Learning for Fast Online Video Instance Segmentation [53.5613957875507]
We present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy.
arXiv Detail & Related papers (2021-04-13T06:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.