Implicit Motion-Compensated Network for Unsupervised Video Object
Segmentation
- URL: http://arxiv.org/abs/2204.02791v2
- Date: Tue, 16 Jan 2024 11:45:07 GMT
- Title: Implicit Motion-Compensated Network for Unsupervised Video Object
Segmentation
- Authors: Lin Xi, Weihai Chen, Xingming Wu, Zhong Liu, and Zhengguo Li
- Abstract summary: Unsupervised video object segmentation (UVOS) aims at automatically separating the primary foreground object(s) from the background in a video sequence.
Existing UVOS methods either lack robustness when there are visually similar surroundings (appearance-based) or suffer from deterioration in the quality of their predictions because of dynamic background and inaccurate flow (flow-based)
We propose an implicit motion-compensated network (IMCNet) combining complementary cues ($textiti.e.$, appearance and motion) with aligned motion information from the adjacent frames to the current frame at the feature level.
- Score: 25.41427065435164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised video object segmentation (UVOS) aims at automatically
separating the primary foreground object(s) from the background in a video
sequence. Existing UVOS methods either lack robustness when there are visually
similar surroundings (appearance-based) or suffer from deterioration in the
quality of their predictions because of dynamic background and inaccurate flow
(flow-based). To overcome the limitations, we propose an implicit
motion-compensated network (IMCNet) combining complementary cues
($\textit{i.e.}$, appearance and motion) with aligned motion information from
the adjacent frames to the current frame at the feature level without
estimating optical flows. The proposed IMCNet consists of an affinity computing
module (ACM), an attention propagation module (APM), and a motion compensation
module (MCM). The light-weight ACM extracts commonality between neighboring
input frames based on appearance features. The APM then transmits global
correlation in a top-down manner. Through coarse-to-fine iterative inspiring,
the APM will refine object regions from multiple resolutions so as to
efficiently avoid losing details. Finally, the MCM aligns motion information
from temporally adjacent frames to the current frame which achieves implicit
motion compensation at the feature level. We perform extensive experiments on
$\textit{DAVIS}_{\textit{16}}$ and $\textit{YouTube-Objects}$. Our network
achieves favorable performance while running at a faster speed compared to the
state-of-the-art methods.
Related papers
- Motion-Aware Video Frame Interpolation [49.49668436390514]
We introduce a Motion-Aware Video Frame Interpolation (MA-VFI) network, which directly estimates intermediate optical flow from consecutive frames.
It not only extracts global semantic relationships and spatial details from input frames with different receptive fields, but also effectively reduces the required computational cost and complexity.
arXiv Detail & Related papers (2024-02-05T11:00:14Z) - Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning [16.094271750354835]
Motion information is critical to a robust and generalized video representation.
Recent works have adopted frame difference as the source of motion information in video contrastive learning.
We present a framework capable of introducing well-aligned and significant motion information.
arXiv Detail & Related papers (2023-09-01T07:03:27Z) - Co-attention Propagation Network for Zero-Shot Video Object Segmentation [91.71692262860323]
Zero-shot object segmentation (ZS-VOS) aims to segment objects in a video sequence without prior knowledge of these objects.
Existing ZS-VOS methods often struggle to distinguish between foreground and background or to keep track of the foreground in complex scenarios.
We propose an encoder-decoder-based hierarchical co-attention propagation network (HCPN) capable of tracking and segmenting objects.
arXiv Detail & Related papers (2023-04-08T04:45:48Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Hierarchical Feature Alignment Network for Unsupervised Video Object
Segmentation [99.70336991366403]
We propose a concise, practical, and efficient architecture for appearance and motion feature alignment.
The proposed HFAN reaches a new state-of-the-art performance on DAVIS-16, achieving 88.7 $mathcalJ&mathcalF$ Mean, i.e., a relative improvement of 3.5% over the best published result.
arXiv Detail & Related papers (2022-07-18T10:10:14Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - FAMINet: Learning Real-time Semi-supervised Video Object Segmentation
with Steepest Optimized Optical Flow [21.45623125216448]
Semi-supervised video object segmentation (VOS) aims to segment a few moving objects in a video sequence, where these objects are specified by annotation of first frame.
The optical flow has been considered in many existing semi-supervised VOS methods to improve the segmentation accuracy.
A FAMINet, which consists of a feature extraction network (F), an appearance network (A), a motion network (M), and an integration network (I), is proposed in this study to address the abovementioned problem.
arXiv Detail & Related papers (2021-11-20T07:24:33Z) - Feature Flow: In-network Feature Flow Estimation for Video Object
Detection [56.80974623192569]
Optical flow is widely used in computer vision tasks to provide pixel-level motion information.
A common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset.
We propose a novel network (IFF-Net) with an textbfIn-network textbfFeature textbfFlow estimation module for video object detection.
arXiv Detail & Related papers (2020-09-21T07:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.