MUNet: Motion Uncertainty-aware Semi-supervised Video Object
Segmentation
- URL: http://arxiv.org/abs/2111.14646v1
- Date: Mon, 29 Nov 2021 16:01:28 GMT
- Title: MUNet: Motion Uncertainty-aware Semi-supervised Video Object
Segmentation
- Authors: Jiadai Sun, Yuxin Mao, Yuchao Dai, Yiran Zhong, Jianyuan Wang
- Abstract summary: We advocate the return of the emphmotion information and propose a motion uncertainty-aware framework (MUNet) for semi-supervised video object segmentation.
We introduce a motion-aware spatial attention module to effectively fuse the motion feature with the semantic feature.
We achieve $76.5%$ $mathcalJ & mathcalF$ only using DAVIS17 for training, which significantly outperforms the textitSOTA methods under the low-data protocol.
- Score: 31.100954335785026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of semi-supervised video object segmentation (VOS) has been greatly
advanced and state-of-the-art performance has been made by dense matching-based
methods. The recent methods leverage space-time memory (STM) networks and learn
to retrieve relevant information from all available sources, where the past
frames with object masks form an external memory and the current frame as the
query is segmented using the mask information in the memory. However, when
forming the memory and performing matching, these methods only exploit the
appearance information while ignoring the motion information. In this paper, we
advocate the return of the \emph{motion information} and propose a motion
uncertainty-aware framework (MUNet) for semi-supervised VOS. First, we propose
an implicit method to learn the spatial correspondences between neighboring
frames, building upon a correlation cost volume. To handle the challenging
cases of occlusion and textureless regions during constructing dense
correspondences, we incorporate the uncertainty in dense matching and achieve
motion uncertainty-aware feature representation. Second, we introduce a
motion-aware spatial attention module to effectively fuse the motion feature
with the semantic feature. Comprehensive experiments on challenging benchmarks
show that \textbf{\textit{using a small amount of data and combining it with
powerful motion information can bring a significant performance boost}}. We
achieve ${76.5\%}$ $\mathcal{J} \& \mathcal{F}$ only using DAVIS17 for
training, which significantly outperforms the \textit{SOTA} methods under the
low-data protocol. \textit{The code will be released.}
Related papers
- Global Motion Understanding in Large-Scale Video Object Segmentation [0.499320937849508]
We show that transferring knowledge from other domains of video understanding combined with large-scale learning can improve robustness of Video Object (VOS) under complex circumstances.
Namely, we focus on integrating scene global motion knowledge to improve large-scale semi-supervised Video Object.
We present WarpFormer, an architecture for semi-supervised Video Object that exploits existing knowledge in motion understanding to conduct smoother propagation and more accurate matching.
arXiv Detail & Related papers (2024-05-11T15:09:22Z) - Video Object Segmentation with Dynamic Query Modulation [23.811776213359625]
We propose a query modulation method, termed QMVOS, for object and multi-object segmentation.
Our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks.
arXiv Detail & Related papers (2024-03-18T07:31:39Z) - Joint Modeling of Feature, Correspondence, and a Compressed Memory for
Video Object Segmentation [52.11279360934703]
Current prevailing Video Object (VOS) methods usually perform dense matching between the current and reference frames after extracting features.
We propose a unified VOS framework, coined as JointFormer, for joint modeling of the three elements of feature, correspondence, and a compressed memory.
arXiv Detail & Related papers (2023-08-25T17:30:08Z) - Event-Free Moving Object Segmentation from Moving Ego Vehicle [88.33470650615162]
Moving object segmentation (MOS) in dynamic scenes is an important, challenging, but under-explored research topic for autonomous driving.
Most segmentation methods leverage motion cues obtained from optical flow maps.
We propose to exploit event cameras for better video understanding, which provide rich motion cues without relying on optical flow.
arXiv Detail & Related papers (2023-04-28T23:43:10Z) - Video Semantic Segmentation with Inter-Frame Feature Fusion and
Inner-Frame Feature Refinement [39.06589186472675]
We propose a spatial-temporal fusion (STF) module to model dense pairwise relationships among multi-frame features.
Besides, we propose a novel memory-augmented refinement (MAR) module to tackle difficult predictions among semantic boundaries.
arXiv Detail & Related papers (2023-01-10T07:57:05Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - FlowVOS: Weakly-Supervised Visual Warping for Detail-Preserving and
Temporally Consistent Single-Shot Video Object Segmentation [4.3171602814387136]
We introduce a new foreground-targeted visual warping approach that learns flow fields from VOS data.
We train a flow module to capture detailed motion between frames using two weakly-supervised losses.
Our approach produces segmentations with high detail and temporal consistency.
arXiv Detail & Related papers (2021-11-20T16:17:10Z) - Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised
Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame.
We exploit this observation by using temporal information to quickly identify frames with minimal change.
We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z) - Learning Spatio-Appearance Memory Network for High-Performance Visual
Tracking [79.80401607146987]
Existing object tracking usually learns a bounding-box based template to match visual targets across frames, which cannot accurately learn a pixel-wise representation.
This paper presents a novel segmentation-based tracking architecture, which is equipped with a local-temporal memory network to learn accurate-temporal correspondence.
arXiv Detail & Related papers (2020-09-21T08:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.