Learning to Associate Every Segment for Video Panoptic Segmentation
- URL: http://arxiv.org/abs/2106.09453v1
- Date: Thu, 17 Jun 2021 13:06:24 GMT
- Title: Learning to Associate Every Segment for Video Panoptic Segmentation
- Authors: Sanghyun Woo, Dahun Kim, Joon-Young Lee, In So Kweon
- Abstract summary: We learn coarse segment-level matching and fine pixel-level matching together.
We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
- Score: 123.03617367709303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal correspondence - linking pixels or objects across frames - is a
fundamental supervisory signal for the video models. For the panoptic
understanding of dynamic scenes, we further extend this concept to every
segment. Specifically, we aim to learn coarse segment-level matching and fine
pixel-level matching together. We implement this idea by designing two novel
learning objectives. To validate our proposals, we adopt a deep siamese model
and train the model to learn the temporal correspondence on two different
levels (i.e., segment and pixel) along with the target task. At inference time,
the model processes each frame independently without any extra computation and
post-processing. We show that our per-frame inference model can achieve new
state-of-the-art results on Cityscapes-VPS and VIPER datasets. Moreover, due to
its high efficiency, the model runs in a fraction of time (3x) compared to the
previous state-of-the-art approach.
Related papers
- Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Merging Tasks for Video Panoptic Segmentation [0.0]
Video panoptic segmentation (VPS) is a recently introduced computer vision task that requires classifying and tracking every pixel in a given video.
To understand video panoptic segmentation, first, earlier introduced constituent tasks that focus on semantics and tracking separately will be researched.
Two data-driven approaches which do not require training on a tailored dataset will be selected to solve it.
arXiv Detail & Related papers (2021-07-10T08:46:42Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z) - Learning Video Instance Segmentation with Recurrent Graph Neural
Networks [39.06202374530647]
We propose a novel learning formulation, where the entire video instance segmentation problem is modelled jointly.
We fit a flexible model to our formulation that, with the help of a graph neural network, processes all available new information in each frame.
Our approach, operating at over 25 FPS, outperforms previous video real-time methods.
arXiv Detail & Related papers (2020-12-07T18:41:35Z) - Learning Fast and Robust Target Models for Video Object Segmentation [83.3382606349118]
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time.
Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting.
We propose a novel VOS architecture consisting of two network components.
arXiv Detail & Related papers (2020-02-27T21:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.