Hierarchical Feature Alignment Network for Unsupervised Video Object
Segmentation
- URL: http://arxiv.org/abs/2207.08485v2
- Date: Tue, 19 Jul 2022 09:23:22 GMT
- Title: Hierarchical Feature Alignment Network for Unsupervised Video Object
Segmentation
- Authors: Gensheng Pei, Fumin Shen, Yazhou Yao, Guo-Sen Xie, Zhenmin Tang,
Jinhui Tang
- Abstract summary: We propose a concise, practical, and efficient architecture for appearance and motion feature alignment.
The proposed HFAN reaches a new state-of-the-art performance on DAVIS-16, achieving 88.7 $mathcalJ&mathcalF$ Mean, i.e., a relative improvement of 3.5% over the best published result.
- Score: 99.70336991366403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optical flow is an easily conceived and precious cue for advancing
unsupervised video object segmentation (UVOS). Most of the previous methods
directly extract and fuse the motion and appearance features for segmenting
target objects in the UVOS setting. However, optical flow is intrinsically an
instantaneous velocity of all pixels among consecutive frames, thus making the
motion features not aligned well with the primary objects among the
corresponding frames. To solve the above challenge, we propose a concise,
practical, and efficient architecture for appearance and motion feature
alignment, dubbed hierarchical feature alignment network (HFAN). Specifically,
the key merits in HFAN are the sequential Feature AlignMent (FAM) module and
the Feature AdaptaTion (FAT) module, which are leveraged for processing the
appearance and motion features hierarchically. FAM is capable of aligning both
appearance and motion features with the primary object semantic
representations, respectively. Further, FAT is explicitly designed for the
adaptive fusion of appearance and motion features to achieve a desirable
trade-off between cross-modal features. Extensive experiments demonstrate the
effectiveness of the proposed HFAN, which reaches a new state-of-the-art
performance on DAVIS-16, achieving 88.7 $\mathcal{J}\&\mathcal{F}$ Mean, i.e.,
a relative improvement of 3.5% over the best published result.
Related papers
- SimulFlow: Simultaneously Extracting Feature and Identifying Target for
Unsupervised Video Object Segmentation [28.19471998380114]
Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing.
Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks.
We propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification.
arXiv Detail & Related papers (2023-11-30T06:44:44Z) - Rethinking Amodal Video Segmentation from Learning Supervised Signals
with Object-centric Representation [47.39455910191075]
Video amodal segmentation is a challenging task in computer vision.
Recent studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting.
This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation.
arXiv Detail & Related papers (2023-09-23T04:12:02Z) - Adaptive Multi-source Predictor for Zero-shot Video Object Segmentation [68.56443382421878]
We propose a novel adaptive multi-source predictor for zero-shot video object segmentation (ZVOS)
In the static object predictor, the RGB source is converted to depth and static saliency sources, simultaneously.
Experiments show that the proposed model outperforms the state-of-the-art methods on three challenging ZVOS benchmarks.
arXiv Detail & Related papers (2023-03-18T10:19:29Z) - Motion-inductive Self-supervised Object Discovery in Videos [99.35664705038728]
We propose a model for processing consecutive RGB frames, and infer the optical flow between any pair of frames using a layered representation.
We demonstrate superior performance over previous state-of-the-art methods on three public video segmentation datasets.
arXiv Detail & Related papers (2022-10-01T08:38:28Z) - Implicit Motion-Compensated Network for Unsupervised Video Object
Segmentation [25.41427065435164]
Unsupervised video object segmentation (UVOS) aims at automatically separating the primary foreground object(s) from the background in a video sequence.
Existing UVOS methods either lack robustness when there are visually similar surroundings (appearance-based) or suffer from deterioration in the quality of their predictions because of dynamic background and inaccurate flow (flow-based)
We propose an implicit motion-compensated network (IMCNet) combining complementary cues ($textiti.e.$, appearance and motion) with aligned motion information from the adjacent frames to the current frame at the feature level.
arXiv Detail & Related papers (2022-04-06T13:03:59Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - FAMINet: Learning Real-time Semi-supervised Video Object Segmentation
with Steepest Optimized Optical Flow [21.45623125216448]
Semi-supervised video object segmentation (VOS) aims to segment a few moving objects in a video sequence, where these objects are specified by annotation of first frame.
The optical flow has been considered in many existing semi-supervised VOS methods to improve the segmentation accuracy.
A FAMINet, which consists of a feature extraction network (F), an appearance network (A), a motion network (M), and an integration network (I), is proposed in this study to address the abovementioned problem.
arXiv Detail & Related papers (2021-11-20T07:24:33Z) - Feature Flow: In-network Feature Flow Estimation for Video Object
Detection [56.80974623192569]
Optical flow is widely used in computer vision tasks to provide pixel-level motion information.
A common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset.
We propose a novel network (IFF-Net) with an textbfIn-network textbfFeature textbfFlow estimation module for video object detection.
arXiv Detail & Related papers (2020-09-21T07:55:50Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.