Towards Robust Video Object Segmentation with Adaptive Object
Calibration
- URL: http://arxiv.org/abs/2207.00887v1
- Date: Sat, 2 Jul 2022 17:51:29 GMT
- Title: Towards Robust Video Object Segmentation with Adaptive Object
Calibration
- Authors: Xiaohao Xu, Jinglu Wang, Xiang Ming, Yan Lu
- Abstract summary: Video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames.
We propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness.
Our model achieves the state-of-the-art performance among existing published works, and also exhibits superior robustness against perturbations.
- Score: 18.094698623128146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the booming video era, video segmentation attracts increasing research
attention in the multimedia community. Semi-supervised video object
segmentation (VOS) aims at segmenting objects in all target frames of a video,
given annotated object masks of reference frames. Most existing methods build
pixel-wise reference-target correlations and then perform pixel-wise tracking
to obtain target masks. Due to neglecting object-level cues, pixel-level
approaches make the tracking vulnerable to perturbations, and even
indiscriminate among similar objects. Towards robust VOS, the key insight is to
calibrate the representation and mask of each specific object to be expressive
and discriminative. Accordingly, we propose a new deep network, which can
adaptively construct object representations and calibrate object masks to
achieve stronger robustness. First, we construct the object representations by
applying an adaptive object proxy (AOP) aggregation method, where the proxies
represent arbitrary-shaped segments at multi-levels for reference. Then,
prototype masks are initially generated from the reference-target correlations
based on AOP. Afterwards, such proto-masks are further calibrated through
network modulation, conditioning on the object proxy representations. We
consolidate this conditional mask calibration process in a progressive manner,
where the object representations and proto-masks evolve to be discriminative
iteratively. Extensive experiments are conducted on the standard VOS
benchmarks, YouTube-VOS-18/19 and DAVIS-17. Our model achieves the
state-of-the-art performance among existing published works, and also exhibits
superior robustness against perturbations. Our project repo is at
https://github.com/JerryX1110/Robust-Video-Object-Segmentation
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z) - Unsupervised Video Object Segmentation via Prototype Memory Network [5.612292166628669]
Unsupervised video object segmentation aims to segment a target object in the video without a ground truth mask in the initial frame.
This challenge requires extracting features for the most salient common objects within a video sequence.
We propose a novel prototype memory network architecture to solve this problem.
arXiv Detail & Related papers (2022-09-08T11:08:58Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - Discovering Object Masks with Transformers for Unsupervised Semantic
Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation.
Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.