Two-Level Temporal Relation Model for Online Video Instance Segmentation
- URL: http://arxiv.org/abs/2210.16795v1
- Date: Sun, 30 Oct 2022 10:01:01 GMT
- Title: Two-Level Temporal Relation Model for Online Video Instance Segmentation
- Authors: \c{C}a\u{g}an Selim \c{C}oban, O\u{g}uzhan Keskin, Jordi Pont-Tuset,
Fatma G\"uney
- Abstract summary: We propose an online method that is on par with the performance of the offline counterparts.
We introduce a message-passing graph neural network that encodes objects and relates them through time.
Our model achieves trained end-to-end, state-of-the-art performance on the YouTube-VIS dataset.
- Score: 3.9349485816629888
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Video Instance Segmentation (VIS), current approaches either focus on the
quality of the results, by taking the whole video as input and processing it
offline; or on speed, by handling it frame by frame at the cost of competitive
performance. In this work, we propose an online method that is on par with the
performance of the offline counterparts. We introduce a message-passing graph
neural network that encodes objects and relates them through time. We
additionally propose a novel module to fuse features from the feature pyramid
network with residual connections. Our model, trained end-to-end, achieves
state-of-the-art performance on the YouTube-VIS dataset within the online
methods. Further experiments on DAVIS demonstrate the generalization capability
of our model to the video object segmentation task. Code is available at:
\url{https://github.com/caganselim/TLTM}
Related papers
- TCOVIS: Temporally Consistent Online Video Instance Segmentation [98.29026693059444]
We propose a novel online method for video instance segmentation called TCOVIS.
The core of our method consists of a global instance assignment strategy and a video-temporal enhancement module.
We evaluate our method on four VIS benchmarks and achieve state-of-the-art performance on all benchmarks without bells-and-whistles.
arXiv Detail & Related papers (2023-09-21T07:59:15Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - OnlineRefer: A Simple Online Baseline for Referring Video Object
Segmentation [75.07460026246582]
Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction.
Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding.
We propose a simple yet effective online model using explicit query propagation, named OnlineRefer.
arXiv Detail & Related papers (2023-07-18T15:43:35Z) - Robust Online Video Instance Segmentation with Track Queries [15.834703258232002]
We propose a fully online transformer-based video instance segmentation model that performs comparably to top offline methods on the YouTube-VIS 2019 benchmark.
We show that, when combined with a strong enough image segmentation architecture, track queries can exhibit impressive accuracy while not being constrained to short videos.
arXiv Detail & Related papers (2022-11-16T18:50:14Z) - Online Video Instance Segmentation via Robust Context Fusion [36.376900904288966]
Video instance segmentation (VIS) aims at classifying, segmenting and tracking object instances in video sequences.
Recent transformer-based neural networks have demonstrated their powerful capability of modeling for the VIS task.
We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames.
arXiv Detail & Related papers (2022-07-12T15:04:50Z) - End-to-end video instance segmentation via spatial-temporal graph neural
networks [30.748756362692184]
Video instance segmentation is a challenging task that extends image instance segmentation to the video domain.
Existing methods either rely only on single-frame information for the detection and segmentation subproblems or handle tracking as a separate post-processing step.
We propose a novel graph-neural-network (GNN) based method to handle the aforementioned limitation.
arXiv Detail & Related papers (2022-03-07T05:38:08Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - 1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
Segmentation [0.39146761527401414]
Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously.
We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack)
By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
arXiv Detail & Related papers (2021-06-12T00:20:38Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.