Modular Interactive Video Object Segmentation: Interaction-to-Mask,
Propagation and Difference-Aware Fusion
- URL: http://arxiv.org/abs/2103.07941v2
- Date: Tue, 16 Mar 2021 03:02:55 GMT
- Title: Modular Interactive Video Object Segmentation: Interaction-to-Mask,
Propagation and Difference-Aware Fusion
- Authors: Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang
- Abstract summary: We present a modular interactive VOS framework which decouples interaction-to-mask and mask propagation.
We show that our method outperforms current state-of-the-art algorithms while requiring fewer frame interactions.
- Score: 68.45737688496654
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present Modular interactive VOS (MiVOS) framework which decouples
interaction-to-mask and mask propagation, allowing for higher generalizability
and better performance. Trained separately, the interaction module converts
user interactions to an object mask, which is then temporally propagated by our
propagation module using a novel top-$k$ filtering strategy in reading the
space-time memory. To effectively take the user's intent into account, a novel
difference-aware module is proposed to learn how to properly fuse the masks
before and after each interaction, which are aligned with the target frames by
employing the space-time memory. We evaluate our method both qualitatively and
quantitatively with different forms of user interactions (e.g., scribbles,
clicks) on DAVIS to show that our method outperforms current state-of-the-art
algorithms while requiring fewer frame interactions, with the additional
advantage in generalizing to different types of user interactions. We
contribute a large-scale synthetic VOS dataset with pixel-accurate segmentation
of 4.8M frames to accompany our source codes to facilitate future research.
Related papers
- Learning from Exemplars for Interactive Image Segmentation [15.37506525730218]
We introduce novel interactive segmentation frameworks for both a single object and multiple objects in the same category.
Our model reduces users' labor by around 15%, requiring two fewer clicks to achieve target IoUs 85% and 90%.
arXiv Detail & Related papers (2024-06-17T12:38:01Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - Explore Synergistic Interaction Across Frames for Interactive Video
Object Segmentation [70.93295323156876]
We propose a framework that can accept multiple frames simultaneously and explore synergistic interaction across frames (SIAF)
Our SwinB-SIAF achieves new state-of-the-art performance on DAVIS 2017 (89.6%, J&F@60)
Our R50-SIAF is more than 3 faster than the state-of-the-art competitor under challenging multi-object scenarios.
arXiv Detail & Related papers (2024-01-23T04:19:15Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - InterFormer: Real-time Interactive Image Segmentation [80.45763765116175]
Interactive image segmentation enables annotators to efficiently perform pixel-level annotation for segmentation tasks.
The existing interactive segmentation pipeline suffers from inefficient computations of interactive models.
We propose a method named InterFormer that follows a new pipeline to address these issues.
arXiv Detail & Related papers (2023-04-06T08:57:00Z) - Holistic Interaction Transformer Network for Action Detection [15.667833703317124]
"HIT" network is a comprehensive bi-modal framework that comprises an RGB stream and a pose stream.
Our method significantly outperforms previous approaches on the J-HMDB, UCF101-24, and MultiSports datasets.
arXiv Detail & Related papers (2022-10-23T10:19:37Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Revisiting Click-based Interactive Video Object Segmentation [24.114405100879278]
CiVOS builds on de-coupled modules reflecting user interaction and mask propagation.
The approach is extensively evaluated on the popular interactiveDAVIS dataset.
The presented CiVOS pipeline achieves competitive results, although requiring a lower user workload.
arXiv Detail & Related papers (2022-03-03T15:55:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.